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Preface 



This volume contains the papers selected for presentation at the Third Pacific- 
Asia Conference on Knowledge Discovery and Data Mining (PAKDD-99) held in 
the Xiangshan Hotel, Beijing, China, April 26-28, 1999. The conference was spon- 
sored by Tsinghua University, National Science Foundation of China, Chinese 
Computer Federation, Toshiba Corporation, and NEC Software Chugoku, Ltd. 

PAKDD-99 provided an international forum for the sharing of original 
research results and practical development experiences among researchers and 
application developers from different KDD-related areas such as machine learn- 
ing, databases, statistics, knowledge acquisition, data visualization, knowledge- 
based systems, soft computing, and high performance computing. It followed the 
success of PAKDD-97 held in Singapore in 1997 and PAKDD-98 held in Aus- 
tralia in 1998 by bringing together participants from universities, industry, and 
government. 

PAKDD-99 encouraged both new theory /methodologies and real world ap- 
plications, and covered broad and diverse topics in data mining and knowledge 
discovery. The technical sessions included: Association Rules Mining; Feature 
Selection and Generation; Mining in Semi, Un-structured Data; Interestingness, 
Surprisingness, and Exceptions; Rough Sets, Fuzzy Logic, and Neural Networks; 
Induction, Classification, and Clustering; Causal Model and Graph-Based Meth- 
ods; Visualization; Agent-Based, and Distributed Data Mining; Advanced Topics 
and New Methodologies. 

Of the 158 submissions, we accepted 29 regular papers and 37 short papers 
for presentation at the conference and for publication in this volume. In addition, 
over 20 papers were accepted for poster presentation. 

The PAKDD-99 program was further supplemented by two invited speak- 
ers: Won Kim and Hiroshi Motoda, a special session on Emerging KDD Technol- 
ogy (Speakers: Zdzislaw Pawlak, Philip Yu, T.Y. Lin, Hiroshi Tsukimoto), and a 
panel session on Knowledge Management in Data Mining (Chair: Xindong Wu; 
Panelists: Rao Kotagiri, Zhongzhi Shi, Jan M. Zytkow). 

Two tutorials: Automated Discovery - Combining AI, Statistics and The- 
ory of Knowledge by Jan M. Zytkow, Quality Data and Effective Mining by 
Hongjun Lu, and a workshop on Knowledge Discovery from Advanced Databases 
organized by Mohamed Quafafou and Philip Yu, were also offered to all confer- 
ence participants on April 26. 

A conference such as this can only succeed as a team effort. We would 
like to acknowledge the contribution of the program committee members and 
thank the reviewers for their reviewing efforts, the PAKDD steering committee 
members for their invaluable input and advice, and the conference chairs: Bo 
Zhang and Setsuo Ohsuga whose involvement and support have added greatly to 
the quality of the conference. Our sincere gratitude goes to all of the authors who 
submitted papers. We are grateful to our sponsors for their generous support. 
Special thanks are due to Alfred Hofmann of Springer- Verlag for his help and 
cooperation. 
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KDD as an Enterprise IT Tool: Reality and 

Agenda 



Won Kim 

Cyber Database Solutions, USA 
won . kimScyberdb . com 

KDD is a key technology for harnessing the “business intelligence” IT infras- 
tructure. Although KDD has been a longstanding field of research, it is in its 
infancy in commercial endeavors. Commercial KDD products have many ma- 
jor weaknesses that have slowed KDD’s becoming an enterprise IT tool. In this 
presentation, I will review the technological and business reality of KDD, and 
discuss what needs to happen before KDD can attain the status that other 
enterprise IT tools have reached, including database servers, data warehouses, 
decision support and OLAP tools, etc. 
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Computer Assisted Discovery of First Principle 
Equations from Numeric Data 



Hiroshi Motoda 

Institute of Scientific and Industrial Research 
Osaka University 

Mihogaoka, Ibaraki, Osaka 567-0047, Japan 
motodaSar . sanken. osaka-u. ac.jp 

Just like physicists have tried for many years to find the truth that is hid- 
den in the observed data by the deep insight and understood the phenomena, 
computer can assist in analyzing huge amount of data that is beyond the ca- 
pability of human cognition and derive equations that explain the phenomena. 
Being able to reproduce the observed data does not necessarily mean that the 
derived equations represent the first principle. We show that there is a method 
that ensures to derive the first principle. Notion of scale types of the observed 
data and interesting properties that are deduced by the scale type constraints 
and dimensional analysis are the basis of the method. These two together with 
simple mathematics can constrain the form of admissible relations among the 
variables in great depth. A complete equation that describes the system behav- 
ior can be represented by a set of dimensionless variables whose cardinality is 
smaller than that of the original variables. Each dimensionless number is related 
to a subset of the original variables and forms a chunk called a regime whose 
form is constrained by the scale types of the variables. An algorithm is devel- 
oped to construct possible regimes in a bottom up way making the best use 
of the constraints. A number of different statistical tests are performed to en- 
sure that the correct relations are identified. The method works for phenomena 
for which nothing is known about the number of equations that are needed to 
describe them and no knowledge of dimensions of the variables involved is avail- 
able. Thus, the method can be applied to discover models of less- well-understood 
domains such as biology, psychology, economics and social science. Many of the 
known first principles that involve several tens of variables and a few tens of 
equations have been re-discovered by numerical experiments with noisy data. 
The talk covers our recent advancement in this research. This is a joint work 
with Takashi Washio. 

References 

1. T. Washio and H. Motoda. Discovering Admissible Models of Complex Systems 
Based on Scale-Types and Identity Constraints. In Proc. of IJCAI97, pp. 810-817, 
1997. 

2. T. Washio and H. Motoda. Discovering Admissible Simultaneous Equations of 
Large Scale Systems. In Proc. of AAAI98, pp. 189-196, 1998. 

3. T. Washio and H. Motoda. Development of SDS2: Smart Discovery System for 
Simultaneous Equation Systems. In Discovery Science, Lecture Notes in Artificial 
Intelligence 1532, Springer, pp. 352-363, 1998. 
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Data Mining - a Rough Set Perspective 



Zdzislaw Pawlak 

Institute for Theoretical and Applied Informatics 
Polish Academy of Sciences 
Baltycka 5, 44 000 Gliwice, Poland 



1 Introduction 

Data mining (DM) can be perceived as a methodology for discovering hidden 
patterns in data. DM is a relatively new area of research and applications, 
stretching, over many domains like statistics, machine learning, fuzzy sets, 
rough sets, cluster analysis, genetics algorithms, neural networks and others. 
Despite many various techniques employed in DM yet it can be seen as a 
distinct discipline with its own problems and aims. 

Reasoning methods associated with discovering knowledge from data at- 
tracted attention of philosophers for many years. Particularly some ideas of 
B. Russell and K. Popper about data, induction and experimental knowledge 
can be viewed as precursory ones for DM. 

Many valuable papers and books have been published on data mining 
recently. In this paper we will focus our attention on some problems pertinent 
to rough sets and DM [2, 3, 5, 6, 7, 8, 9, 11, 14, 15, 16, 19, 24, 32, 33, 36, 37]. 

Rough set theory has proved to be useful in DM, and it ”... constitutes a 
sound basis for data mining applications” [4] . The theory offers mathematical 
tools to discover hidden patterns in data. It identifies partial or total depen- 
dencies (i.e. cause-effect relations) in databases, eliminates redundant data, 
gives approach to null values, missing data, dynamic data and others. The 
methods of data mining in large databases using rough sets have recently 
been proposed and investigated [5, 14, 16]. 

The theory is based on sound mathematical foundation. It can easily be 
understood and applied. Several software systems based on rough set theory 
have been implemented and many nontrivial applications of this methodology 
for knowledge discovery have been reported. More about rough sets and their 
applications can be found in [19]. 

The theory is not competitive but complementary to other methods and 
can also be often used jointly with other approaches (e.g. fuzzy sets, genetic 
algorithms, statistical methods, neural networks etc.). 

The main objective of this talk is to give basic ideas of rough sets in the 
context of DM. The starting point of rough set theory is a data set. The 
theory can also be formulated in more general terms, however for the sake of 
intuition we will refrain from general formulation. Data are usually organized 
in the form of a table, columns of which are labeled by attributes, rows - by 
objects and entries of the table are attribute values. Such a table will be 
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called a database. Next, basic operations on sets in rough set theory, the 
lower and the upper approximation of a set will be defined. These operations 
will be used to define the basic concepts of the theory (from the DM point 
of view) - total and partial dependency of attributes in a database. The 
concept of dependency of attributes is used to describe cause-effect relations 
hidden in the data. Further, a very important issue, reduction of data, will be 
introduced. Finally certain and possible decision rules determined by total 
and partial dependencies will be defined and analyzed. Besides, certainty 
and coverage factors of a decision rule will be defined and reasoning methods 
based on the idea outlined. 

2 Database 

An example of a simple database is presented in Table 1. 



Table 1. An example of a database 



Store 


E 


Q 


L 


p 


1 


high 


good 


no 


profit 


2 


med. 


good 


no 


loss 


3 


med. 


good 


no 


profit 


4 


no 


avg. 


no 


loss 


5 


med. 


avg. 


yes 


loss 


6 


high 


avg. 


yes 


profit 



In the database six stores are characterized by four attributes: 

E - empowerment of sales personnel, 

Q - perceived quality of merchandise, 

L - high traffic location, 

P - store profit or loss. 

Each store is described in terms of attributes E,Q,L and P. 

Each subset of attributes determines a partition (classification) of all ob- 
jects into classes having the same description in terms of these attributes. Eor 
example, attributes Q and L aggregate all stores into the following classes 
{1, 2, 3}, {4}, {5, 6}. Thus, each database determines a family of classification 
patterns which are used as a basis of further considerations. 

Eormally a database will be defined as follows. 

By a database we will understand a pair S = (U,A), where U and A are 
finite, nonempty sets called the universe and a set of attributes respectively. 
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With every attribute a £ A we associate a set Va of its values, called the 
domain of a. Any subset B of A determines a binary relation 1(B) on U, 
which will be called an indiscernihility relation and is defined as follows: 

(x,y) £ 1(B) if and only if a(x) = a(y) for every a £ A, where a(x) 
denotes the value of attribute a for element x. 

It can easily be seen that 1(B) is an equivalence relation. The family of 
all equivalence classes of 1(B), i.e. partition determined by B will be denoted 
by U 11(B) or simple 17/B; an equivalence class of 1(B), i.e. block of the 
partition U JB containing x will be denoted by B(x). 

If (x,y) belongs to 1(B) we will say that x and y are B-indiseernihle. 
Equivalence classes of the relation 1(B) (or blocks of the partition U / B) are 
referred to as B-elementary sets or B-granules. 

Equivalence relation as a basis for rough set theory for many applications 
is not sufficient. Therefore other relations e.g. a tolerance relation, an ordering 
relations and others, have been proposed, e.g. [21, 23, 31]. But for the sake 
of simplicity in this paper we will stick to the equivalence relation as a basis 
for rough set theory. 

3 Approximations of Sets 

First let us consider the following exmaple: what are the characteristic fea- 
tures of stores having profit (or loss) in view of information available in Table 
1. It can easily be seen that this question cannot be answered uniquely since 
stores 2 and 3 display the same features in terms of attributes E, Q and L, 
but store 2 makes a profit, whereas store 3 has a loss. In view of information 
contained in Table 1, we can say for sure that stores 1 and 6 make a profit, 
stores 4 and 5 have a loss, whereas stores 2 and 3 cannot be classified as 
making a profit or having a loss. Employing attributes E, Q and L, we can 
say that stores 1 and 6 surely make a profit, i.e. surely belong to the set {1, 3, 
6}, whereas stores 1, 2, 3 and 6 possibly make a profit, i.e. possibly belong to 
the set {1, 3, 6 }. We will say that the set {1, 6} is the lower approximation of 
the set (concept) {1, 3, 6} and the set {1, 2, 3, 6} is the upper approximation 
of the set {1, 3, 6}. The set {2, 3}, being the difference between the upper 
approximation, and the lower approximation, is referred to as the boundary 
region of the set {1, 3, 6}. 

Approximations can be defined formally as operations assigning to every 
X C U two sets B^(X) and B*(X) called the B-lower and the B-upper 
approximation of X, respectively and defined as follows: 

B,(X)= |J{B(a:) : B(x) C X}, 
xeu 

B*(X) = U {B(x) : B(a;) nX ^ 0}. 
xeu 
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Hence, the B-lower approximation of a concept is the union of all B-granules 
that are included in the concept, whereas the B-upper approximation of a 
concept is the union of all B-granules that have a nonempty intersection with 
the concept. The set 



BNb{X) = B*{X) - B,{X) 



will be referred to as the B-boundary region of X. 

If the boundary region of X is the empty set, i.e., BNb{X) = 0, then X 
is crisp (exact) with respect to B; in the opposite case, i.e., if BNb(X) ^ 0, 
X is referred to as rough (inexact) with respect to B. 

” Roughness” of a set can be also characterized numerically as 



aB(X) 



card(B,(X)) 

card(B*(X)y 



where 0 < aB(X) < 1 and if aB(X) = 1, X is crisp with respect to B, 
whereas ii aB(X) < 1, X is rough with respect to B. 

Rough sets can be also defined using a rough membership function [17], 
defined as 



fJ-x(x) 



card(B(x) fl X) 
card(B(x)) 



Obviously 



0 < Pxi^) ^ 1 - 



Value of the membership function (a;) is a conditional probability Tr(X\B(x)), 
and can be interpreted as a degree of certainty to which x belongs to X (or 
1 — iJx(x), as a degree of uncertainty). 



4 Dependency of Attributes 

Another important issue in data analysis is discovering dependencies between 
attributes. Suppose that the set of attributes A in a database S = (U, A) is 
divided into two subsets C and D, called condition and decision attributes 
respectively, such that C U D = A and C H D = 0. Such databases are called 
decision tables. 

Intuitively, a set of attributes D depends totally on a set of attributes C, 
denoted C T>, if all values of attributes from D are uniquely determined 
by values of attributes from C. In other words, D depends totally on C, if 
there exists a functional dependency between values of C and D. 

We would need also a more general concept of dependency, called a partial 
dependency of attributes. Intuitively, the partial dependency means that only 
some values of D are determined by values of C. 

Dependency is strictly related with approximations and is the basic issue 
in data mining, because it reveals relationships in a database. 
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Formally, dependency can be defined in the following way. Let C and D 
be subsets of A. 

We will say that D depends on C* to a degree k {0 < k < 1), denoted 
C D, if 



k = 7(C,D) 



E 

xeu/D 



card{C^{X)) 
card (U) 



If A: = 1 we say that D depends totally on C, and if A: < 1, we say that D 
depends partially (to a degree k) on C, and if A: = 0, then D does not depend 
on C. 

The coefficient k expresses the ratio of all elements of the universe, which 
can be properly classified to blocks of the partition U JD, employing attributes 
C and will be called the degree of the dependeney. 

For example in Table 1 the degree of dependency between the attribute 
P and the set of attributes {E,Q,L} is 2/3. 

Obviously if D depends totally on C then I{C) C I{D). That means that 
the partition generated by C is finer than the partition generated by D. 



5 Reduction of Attributes 

A reduct is the minimal set of condition attributes that preserves the degree 
of dependency. It means that a reduct is a minimal subset of condition at- 
tributes that enables to make the same decisions as the whole set of condition 
attributes. 

Formally if C =>k D then a minimal subset C" of C, such that 7 (C, D) = 
7 (C", D) is called a D-reduet of C. 

For example, in Table 1 we have two reducts {E,Q} and {E,L} of con- 
dition attributes {E,Q,L}. 

Reduction of attributes is the fundamental issue in rough set theory. 

In large databases computation of reducts on the basis of the given defi- 
nition is not a simple task and therefore many more effective methods have 
been proposed. For references see [19]. 

6 Significance of Attributes 

The concept of a reduct enables us to remove some attributes in the database 
in such a way that the basic relationships in the database are preserved. Some 
attributes, however, cannot be removed from the database without changing 
their properties. To express this idea more precisely we will need the notion 
of signifieanee of an attribute, which is defined next. 

Let C and D be sets of condition and decision attributes respectively and 
let a be a condition attribute, i.e. a & C. We can ask how the coefficient 
'y{C,D) changes when removing the attribute a, i.e. what is the difference 
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between 'y{C,D) and 'y{C — {a},D). We can normalize the difference and 
define the significance of an attribute a as: 



(^(c,D) (a) 



{^{C,D)-^{C-{a},D)) ^{C-{a},D) 

l{C,D) ^{C,D) 



and denote simply by (r{a), when C and D are understood. 

Thus the coefficient a {a) can be understood as an error which occurs 
when attribute a is dropped. The significance coefficient can be extended to 
set of attributes as follows: 

{7(C,D)-7(C-B,D)) , 7(C-B,D) 

= W:d) = ‘ - {;C.D) ■ 

denoted by o{B), if C and D are understood, where B is a subset of C. 

If B is a reduct of C, then a{B) = 1, i.e. after removing any reduct from 
the set of decision rules one cannot make sure decisions, whatsoever. 



7 Decision Rules 

Dependences between attributes are usually symbolized as a set of decision 
rules. For example, decision rules describing the dependency {E,Q} {P} 

in Table 1 are the following: 

(E, high) and (good) (profit), 

(E, med.) and (good) (loss), 

(E, med.) and (good) (profit), 

(E, no) and (avg.) (loss), 

(E, med.) and (avg.) (loss), 

(E, high) and (avg.) (profit). 

A set of decision rules is usually referred as a knowledge base. 

Usually we are interested in the optimal set of decision rules associated 
with the dependency, but we will not consider this issue here. Instead we will 
analyze some probabilistic properties of decision rules. 

Let 5 be a decision table and let C and D be condition and decision 
attributes, respectively. 

By etc. we will denote logical formulas built up from attributes, 
attribute-values and logical connectives (and, or, not) in a standard way. We 
will denote by |#|s the set of all objects x £ U satisfying # and refer to as 
the meaning of in S. 

The expression tts(^) = denote the probability that the 

formula # is true in S. 

A deeision rule is an expression in the form ”if. . . then. . . ”, written ^ ^ 
and ^ are referred to as eondition and deeision of the rule, respectively. 
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A decision rule ^ ^ ^ is admissible in S if is the union of some C- 
elementary sets, is the union of some D-elementary sets and \^A^\s 0- 

In what follows we will consider admissible decision rules only. 

With every decision rule ^ ^ ^ we associate a certainty factor 



TTsi'PW) 



card{\^ A il'ls) 
card{\$\s) ’ 



which is the conditional probability that ^ is true in S, given <P is true in S 
with the probability 7 Ts(#). 

Besides, we will also need a coverage factor [26] 



TTsi^l'P) 



card{\^ A 
card(\^\s) ’ 



which is the conditional probability that <P is true in S, given ^ is true in S 
with the probability irsi^)- 

Let be a set of decision rules such that all conditions are 

pairwise mutually exclusive, i.e. ]#, A #j|s = 0, for any I < i,j < n, i ^ j, 
and 

n 

= ( 1 ) 

Then the following properties hold: 



n 

Trs('P) = ■ 7Ts(^i), 



(2) 



7Ts(<PI>P) 



7rs(>PI<P) ■7Ts(<P) 

E”=l ■7Ts(<Pi)' 



(3) 



It can be easily seen that the relationship between the certainty factor 
and the coverage factor, expressed by the formula (3) is the Bayes’ theorem 
[1]. The theorem enables us to discover relationships in the databases. 



8 Conclusions 

Data mining is the quest for knowledge in databases. Many methods have 
been proposed for knowledge discovery in databases. No doubt rough sets 
proved to be a valuable methodology for data mining. Some advantages of 
rough set theory in this context are listed below: 

• provides efficient algorithms for finding hidden patterns in data 

• finds minimal sets of data (data reduction) 

• evaluates significance of data 

• generates minimal sets of decision rules from data 

• it is easy to understand and offers straightforward interpretation of results 

The rough set approach to data mining is not competive to other methods but 
rather complementary and can be also used jointly with other approaches. 
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Abstract. This paper provides a survey of various data mining tech- 
niques for advanced database applications. These include association 
rule generation, clustering and classification. With the recent increase 
in large online repositories of information, such techniques have great 
importance. The focus is on high dimensional data spaces with large vol- 
umes of data. The paper discusses past research on the topic and also 
studies the corresponding algorithms and applications. 



1 Introduction 

Data mining has recently become an important area of research. The reason 
for this recent interest in the data mining area arises from its applicability to 
a wide variety of problems, including not only databases containing consumer 
and transaction information, but also advanced databases on multimedia, spatial 
and temporal information. In this paper, we will concentrate on discussing a few 
of the important problems in the topic of data mining. These problems include 
those of finding associations, clustering and classification. In this section, we will 
provide a brief introduction to each of these problems and elaborate on them in 
greater detail in later sections. 

(1) Associations: This problem often occurs in the process of finding rela- 
tionships between different attributes in large customer databases. These 
attributes may either be 0-1 literals, or they may be quantitative. The idea 
in the association rule problem is to find the nature of the causalities be- 
tween the values of the different attributes. Consider a supermarket example 
in which the information maintained for the different transactions is the sets 
of items bought by each consumer. In this case, it may be desirable to find 
how the purchase behavior of one item affects the purchase behavior of an- 
other. Association Rules help in finding such relationships accurately. Such 
information may be used in order to make target marketing decisions. It can 
also be generalized to do classification of high dimensional data [21]. 

(2) Clustering: In the clustering problem, we group similar records together in 
a large database of multidimensional records. This creates segments of the 
data which have considerable similarity within a group of points. Depending 
upon the application, each of these segments may be treated differently. 
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For example, in image and video databases, clustering can be used to detect 
interesting spatial patterns and features and support content based retrievals 
of images and videos using low-level features such as texture, color histogram, 
shape descriptions, etc. In insurance applications, the different partitions 
may represent the different demographic segments of the population each of 
which have different risk characteristics, and may be analyzed separately. 
(3) Classification: The classification problem is very closely related to the clus- 
tering problem, and is referred to as supervised learning^ as opposed to the 
clustering problem which is referred to as unsupervised learning. In the classi- 
fication problem, the attributes are divided into two categories: a multiplicity 
of feature attributes, and a single class label. The training data is used in 
order to model the relationship between the feature attributes and the class 
label. This model is used in order to predict the class label of a test example 
in which only the feature attributes are known. Consider for example, the 
previous insurance application in which we have a large training data set in 
which the different records represent the feature values corresponding to the 
demographic behavior of the population, and a class label which represents 
the insurance risk for each such example. The training data is used in order 
to predict the risk for a given set of feature attributes. 

In this paper we will discuss each of the above data mining techniques and 
their applications. In section 2, we will discuss the association rule problem, 
and its application and generalizations for many real problems. In section 3, 
we will discuss the clustering problem. We will also discuss the difficulties in 
clustering for very high dimensional problems, and show how such difficulties 
may be surmounted by using a generalization of the clustering problem, which 
we refer to as projected clustering. In section 4, we discuss the various techniques 
for classification, and its applications. Finally, the conclusion and summary is 
presented in section 5. 

2 Association Rules 

Association rules find the relationships between the different items in a database 
of sales transactions. Such rules track the buying patterns in consumer behavior 
eg. finding how the presence of one item in the transaction affects the pres- 
ence of another and so forth. The problem of association rule generation has 
recently gained considerable importance in the data mining community because 
of the capability of its being used as an important tool for knowledge discovery. 
Consequently, there has been a spurt of research activity in the recent years 
surrounding this problem. 

Let I = {ii, Z 2 , . . . , im} be a set of binary literals called items. Each transac- 
tion r is a set of items, such that TCI. This corresponds to the set of items 
which a consumer may buy in a basket transaction. 

An association rule is a condition of the form X ^ Y where X C I and 
Y C I are two sets of items. The idea of an association rule is to develop a 
systematic method by which a user can figure out how to infer the presence 



Data Mining Techniques for Associations, Clustering and Classification 



15 



of some sets of items, given the presence of other items in a transaction. Such 
information is useful in making decisions such as customer targeting, shelving, 
and sales promotions. 

The support of a rule Ai y is the fraction of transactions which contain 
both X and Y. 

The confidence of a rule Ai T is the fraction of transactions containing X, 
which also contain Y. Thus, if we say that a rule has 90% confidence then it 
means that 90% of the tuples containing X also contain Y. 

The process of mining association rules is a two phase technique in which 
all large itemsets are determined, and then these large itemsets are used in 
order to find the rules [5]. The large itemset approach is as follows. Generate all 
combinations of items that have fractional transaction support above a certain 
user-defined threshold called minsupport. We call all such combinations large 
itemsets. Given an itemset S = {ii, i 2 , . . . , ifc}, we can use it to generate at 
most k rules of the type [S — {v}] {*r} for each r G {1, . . . , fc}. Once these 

rules have been generated, only those rules above a certain user defined threshold 
called minconfidence may be retained. 

In order to generate the large itemsets, an iterative approach is used to first 
generate the set of large 1-itemsets Li, then the set of large itemsets L 2 , and 
so on until for some value of r the set is empty. At this stage, the algorithm 
can be terminated. During the fcth iteration of this procedure a set of candi- 
dates Ck is generated by performing a (fe — 2)-join on the large itemsets Lk-i- 
The itemsets in this set Ck are candidates for large itemsets, and the final set 
of large itemsets Lk must be a subset of Ck- Each element of Ck needs to be 
validated against the transaction database to see if it indeed belongs to Lk ■ The 
validation of the candidate itemset Ck against the transaction database seems to 
be bottleneck operation for the algorithm. This method requires multiple passes 
over a transaction database which may potentially be quite time consuming. 
Subsequent work on the large itemset method has concentrated on the following 
aspects: 

(1) Improving the I/O costs: Brin et. al. proposed a method for large itemset 
generation which reduces the number of passes over the transaction database 
by counting some (fc -I- l)-itemsets in parallel with counting fc-itemsets. In 
most previously proposed algorithms for finding large itemsets, the support 
for a (fc -I- l)-itemset was counted after fc-itemsets have already been gener- 
ated. In this work, it was proposed that one could start counting a (fc -I- 1)- 
itemset as soon as it was suspected that this itemset might be large. Thus, 
the algorithm could start counting for (fc -|- l)-itemsets much earlier than 
completing the counting of fc-itemsets. The total number of passes required 
by this algorithm is usually much smaller than the maximum size of a large 
itemset. A partitioning algorithm was proposed by Savasere et. al. [26] for 
finding large itemsets by dividing the database into n partitions. The size of 
each partition is such that the set of transactions can be maintained in main 
memory. Then, large itemsets are generated separately for each partition. 
Let LPi be the set of large itemsets associated with the *th partition. Then, 
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if an itemset is large, then it must be the case that it must belong to at least 
one of LPi for i € {1, . . . , fc}. Now, the support of the candidates 
can be counted in order to find the large itemsets. This method requires just 
two passes over the transaction database in order to find the large itemsets. 

(2) Improving the computational efficiency of the large itemset pro- 
cedure: A hash-based algorithm for efficiently finding large itemsets was 
proposed by Park et. al. in [24]. It was observed that most of the time 
in the was spent in evaluating and finding large 2-itemsets. The algorithm 
in Park et. al.[24] attempts to improve this approach by providing a hash 
based algorithm for quickly finding large 2-itemsets. When augmented with 
the process of transaction-trimming, this technique results in considerable 
computational advantages. 

A common feature of most of the algorithms proposed in the literature is that 
most such research is are variations on the “bottom-up theme” proposed by 
the Apriori algorithm[-5,6]. For databases in which the itemsets may be long, 
these algorithms may require substantial computational effort. Consider for 
example a database in which the length of the longest itemset is 40. In this 
case, there are 2^° subsets of this single itemset, each of which would need 
to be validated against the transaction database. Thus, the success of the 
above algorithms critically relies on the fact that the length of the frequent 
patterns in the database are typically short. An interesting algorithm for 
itemset generation has been proposed very recently by Bayardo [7]. This 
algorithm uses clever “look-ahead” techniques in order to identify longer 
patterns earlier on. The subsets of these patterns can then be pruned from 
further consideration. Computational results in [7] indicate that the algo- 
rithm can lead to substantial performance improvements over the Apriori 
method. 

(3) Extensions of the large itemset method beyond binary data: Ini- 
tially, the association rule problem was proposed in the context of supermar- 
ket data. The motivation was to find how the items bought in a consumer 
basket related to each other. A number of interesting extensions and appli- 
cations have been proposed. The problem of mining quantitative association 
rules in relational tables was proposed in [30] . In such cases association rules 
are discovered in relational tables which have both categorical and quantita- 
tive attributes. Thus, it is possible to find rules which indicate how a given 
range of quantitative and categorical attributes may affect the values of other 
attributes in the data. The algorithm for the quantitative association rule 
problem discretizes the quantitative data into disjoint ranges and then con- 
structs an item corresponding to each such range. Once these pseudo-items 
have been constructed, a large itemset procedure can be applied in order to 
find the association rules. Often a large number of rules may be produced 
by such partitioning methods, many of which may not be interesting. An 
interest measure was defined and used in [30] in order to generate the asso- 
ciation rules. Variations of the quantitative association rule technique may 
be used in order to generate profile association rules [3]. Such rules relate 
the consumer profile information to their buying behavior. 
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An interesting issue is that of handling taxonomies of items. For example, in 
a store, there may be several kinds of cereal, and for each individual kind of 
cereal, there may be multiple brands. Rules which handle such taxonomies 
are called generalized associations. The motivation is to generate rules which 
are as general and non-redundant as possible while taking such taxonomies 
into account. Algorithms for finding such rules were presented in [28]. 

(4) Algorithms for online generation of association rules: Since the size 
of the transaction database may be very large, the algorithms for finding as- 
sociation rules are both compute-intensive and require substantial I/O. Thus 
it is difficult to provide quick responses to user queries. Methods for online 
generation of association rules have been discussed by Aggarwal and Yu [2]. 
This algorithm uses the preprocess-once-query-many paradigm of OLAP in 
order to generate association rules quickly by using an adjacency lattice to 
prestore itemsets. The interesting feature of this work is that the rules which 
are generated are independent of both the size of the transaction data and 
the number of itemsets prestored. In fact, the running time of the algorithm 
is completely proportional to the size of the output. It is also possible to 
generate queries for rules with specific items in them. In the same work, 
redundancy in association rule generation has been discussed. A rule is said 
to be redundant at a given level of support and confidence if its existence is 
implied by some other rule in the set. For example, consider the following 
pair of rules: {Milk} (Bread, Butter} and (Milk, Bread} ^ {Butter}. In 
this example, the second rule is redundant since its existence is implied by 
the first. Algorithms were proposed to generate a minimal set of essential 
rules for a given set of data. 

(5) Alternatives to the large itemset model: The large itemset model is 
a useful tool for mining the relationships among the items when the data 
is sparse. Unfortunately, the method is often difficult to generalize to other 
scenarios because of its lack of statistical robustness. Several models have 
been developed in order to take these statistical considerations into account. 
Among these models are included the correlation model [9] and the collective 
strength model [4]. 

3 Clustering 

The clustering problem has been discussed extensively in the database literature 
as a tool for similarity search, customer segmentation, pattern recognition, trend 
analysis and classification. The method has been studied in considerable detail by 
both the statistics and database communities [8,12,13,15,18,19]. Detailed studies 
on clustering methods may be found in [16]. 

The problem of clustering data points can be defined as follows: Given a set 
of points in multidimensional space, find a partition of the points into clusters so 
that the points within each cluster are close to one another. (There may also be 
a group of outlier points.) Some algorithms assume that the number of clusters 
is prespecified as a user parameter. 



18 



Charu C. Aggarwal and Philip S. Yu 



Most clustering algorithms do not work efficiently in higher dimensional 
spaces because of the inherent sparsity of the data[14]. In high dimensional 
applications, it is likely that at least a few dimensions exist for which a given 
pair of points are far apart from one another. So a clustering algorithm is often 
preceded by feature selection (See, for example [20]). The goal is to find the 
particular dimensions for which the points in the data are correlated. Pruning 
away irrelevant dimensions reduces the noise in the data. The problem of us- 
ing traditional feature selection algorithms is that picking certain dimensions 
in advance can lead to a loss of information. Furthermore, in many real data 
examples, some points are correlated with respect to a given set of dimensions 
and others are correlated with respect to different dimensions. Thus it may not 
always be feasible to prune off too many dimensions without at the same time 
incurring a substantial loss of information. We demonstrate this with the help 
of an example. 
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Fig. 1. Difficulties Associated with Feature Preselection 



In Figure 1, we have illustrated two different projected cross sections for a 
set of points in 3-dimensional space. There are two patterns to the data. The 
first pattern corresponds to a set of points in the x-y plane, which are close in 
distance to one another. The second pattern corresponds to a set of points in 
the x-z plane, which are also close in distance. We would like to have some way 
of discovering such patterns. Feature preselection is not a viable option here, 
since each dimension is relevant to at least one of the clusters. 

In this context, we shall now define what we call a projected cluster. Consider 
a set of data points in some (possibly large) dimensional space. A projected 
cluster is a subset D of dimensions together with a subset C of data points 
such that the points in C are closely clustered in the projected subspace of 
dimensions D. In Figure I, two clusters exist in two different projected subpaces. 
Cluster 1 exists in projected x-y space, while cluster 2 exists in projected x-z 
space. 
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We assume that the number k of clusters to be found is an input parameter. 
The output of the algorithm will be twofold: 

— a {k + l)-way partition {Ci, ...,Ck,0} of the data, such that the points in 
each partition element except the last form a cluster. (The points in the last 
partition element are the outliers, which by definition do not cluster well.) 

— a possibly different subset Di of dimensions for each cluster Ci, 1 < i < 
k, such that the points in the ith cluster are correlated with respect to 
these dimensions. (The dimensions for the outlier set O can be assumed to 
be empty.) The cardinality of each set Di for the different clusters can be 
different. 

Techniques for performing projected clustering have been discussed in [1]. 
One advantage of this technique is that since its output is two-fold in terms of 
reporting both the points and the dimensions, it gives the user a very good idea 
of both the identity and nature of the similarity of the different points in each 
cluster. For example, consider similarity searches in image and video databases 
using precomputed features. The number of features captured can potentially be 
very large. In a marketing application, the information available in the database 
can contain thousands of attributes on customer profile and product purchased. 
In either case, it is very unlikely that similarity can be found for each and every 
feature or attribute in the high dimensional space. However, it is possible to 
segment the data into groups, such that each group is defined by its similarity 
based on a specific set of attributes. Clearly the representative dimensions for 
each cluster is useful information, since it may directly be used for analyzing the 
behavior of that cluster. 

4 Classification 

The problem of classification has been studied extensively by the database and 
Artificial Intelligence communities. The problem of classification is defined as 
follows: 

The input data is referred to as the training set, which contains a plurality 
of records, each of which contains multiple attributes or features. Each example 
in the training set is tagged with a class label. The class label may either be 
categorical or quantitative. The problem of classification in the context of a 
quantitative class label is referred to as the regression modeling problem. The 
training set is used in order to build a model of the classification attribute based 
upon the other attributes. This model is used in order to predict the value of 
the class label for the test set. 

Some well known techniques for classification include the following: 

(1) Decision Tree Techniques: The idea in decision trees is to recursively par- 
tition the data set until each partition contains mostly of examples from a 
particular class [23,27]. Each non-leaf node of the tree contains a split point 
which uses some condition to decide how the data set should be partitioned. 
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For each example in the training data, the split point is used in order to 
find how to best partition the data. The performance of the decision tree de- 
pends critically upon how how the split point is chosen. The condition which 
describes the split point is described as a predicate. Predicates with high in- 
ference power are desirable for building decision trees with better accuracy. 
Most decision trees in the literature are based on single attribute predicates. 
Recent work by Chen and Yu [11] has discussed how to use multi-attribute 
predicates in building decision trees. 

(2) k-Nearest Neighbor Techniques: In this case, we find the nearest neighbors 
of the test example and assign its class label based on the majority labels 
of the nearest neighbors. The distributions of the class labels in the nearest 
neighbors may also be used in order to find the relative probabilities for the 
test example to take on different values. Thus, nearest neighbor techniques 
assume that locality in the feature space may often imply strong relationships 
among the class labels. This technique may often lose robustness in very 
high dimensional space, since the data tends to be sparse, and the concept 
of locality is no longer well defined. (In other words, well defined clusters do 
not exist in the original feature space.) 

(3) DNF Rules: The left hand side of the DNF rules consist of a union of a set 
of possible non-disjoint regions in the feature space. The right hand side of 
the rule corresponds to the class label. For each test example, we find the 
rules whose left hand side may contain the particular instance of the example. 
More than one rule may be discovered which contain the test example. These 
rules may be used in order to predict the final class label. In general, it is 
desirable to provide a minimality property while building DNF rules. For a 
given test example, it desirable to decide the label based on a few of the 
DNF rules. Thus, it is desirable to cover the feature space with as few rules 
as possible without losing the universality of coverage. More details on past 
work done on DNF rules may be found in RAMP [10]. 

(4) Neural Networks: A variety of other methods are known to exist for classi- 
fication. These include neural networks [22] and genetic algorithms among 
others. A neural network is a data structure constructed of a network of 
neurons which are functions taking in one or more values and returning an 
output, which is the class label. The functions in the neurons are defined 
based on the weights of the nodes connecting the links. During the training 
phase, the data is fed to the neural network one by one, and the functions in 
the neurons are modified based on the error rates of the resulting outputs. 
Multiple passes are required over the data in order to train the network. As 
a result, the training times are quite large even for small datasets. 

(5) Bayesian Classifiers: Bayesian Classifiers [17] fall into a category of clas- 
sifiers, which are parametric in nature. Parametric classifiers are desirable 
for those applications in which the number of feature variables are very 
large. For example, text classification is best handled by such classifiers. The 
training data is assumed to fit a certain probability distribution in terms 
of the values of the features for the different classes. This is used in order 
to estimate the parameters of this probability distribution. For a given test 
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document, the features in it are used in order to determine its class label. In 
order to simplify the process of model building, it is often assumed that the 
features are statistically independent from one another. This assumption is 
far from accurate, and hence such a classifier is referred to as the naive Bayes 
Classifier. Such classifiers have been demonstrated to perform surprisingly 
well in a very wide variety of problems in spite of the simplistic nature of 
the model. 



4.1 Applications of Classification 

The classification problem has several direct applications in target marketing, 
and electronic commerce. A prototypical application of classification is that of 
mass mailing for marketing. For example, credit card companies often mail so- 
licitations to consumers. Naturally, they would like to target those consumers 
who are most likely to respond to a mailer. Often demographic information is 
available for those people who have responded before to such solicitations, and 
this information may be used in order to target the most likely respondents. 
Other applications of classification include risk estimation and analysis, fraud 
detection, and credit scoring. 

Another area of exploration includes medical diagnosis, in which large sets of 
features containing the medical history of a subject is used in order to make the 
diagnosis of a patient regarding his medical condition, weather prediction based 
on satellite image, etc. 

Algorithms for text classification have considerable application in automated 
library organization. Also, many e-commerce applications may be handled using 
text classification, since the subject material from a web page may be inferred 
from the text on it. This subject material may be used in order to make product 
and marketing recommendations. Text databases pose a special challenge to 
classifiers because of the huge number of features in terms of which the data is 
represented. 

5 Conclusions and Summary 

This paper discussed a survey of some of the important techniques for data 
mining including association rule mining, clustering, and classification. We pro- 
vided an overview of the algorithms which for each of these techniques, and their 
applications to high dimensional data space. 
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Abstract. In the relational database theory, it is assumed that the uni- 
verse U to be represented is a set. The classical data mining took such 
assumption. In real life applications, the entities are often related. A 
’’new” data mining theory is explored with such additional semantics. 



1 Introduction 

In relational database theory, the universe of entities is represented as a set 
of tuples, and each attribute domain is tactically assumed to be a set 0. In 
other words, we assume there are no relationships among the entities or the 
attribute values of each domain respectively. All data are sets. However, in real 
life, the universe, in stead of being a set of entities, is a class of objects. These 
objects are naturally partitioned (clustered) into clumps of objects which are 
drawn together by indistinguishability, similarity or functionality m Briefly, 
data are clustered together by some relationships. What is the mathematical 
structure of such relationships? In formal logic, a relational structure is assumed 
[Z]; Intuitively, a clump is a neighborhood (e.g., the nearest neighborhood)of 
an object or objects. These relationships can be formulated in terms of binary 
relations or neighborhood systems jSj. 

Rougly speaking, data mining is just the opposite of database processing. 
Database processing is to organize and to store data according to the given 
semantics, such as functional dependencies and etc., while data mining is to dis- 
cover the semantics of data from the stored bits and bytes. One should point 
out that the actual meaning of bits and bytes are to human only, it plays no role 
in the processings. The term ’’knowledge discovering” or ’’data mining” means 
a derivation of properties, which are interested to human, from and only from 
the formal structures of the stored data. ^From this aspect, data mining of rela- 
tional theory is to derive set theoretical properties of stored data. In this paper, 
however, our interests are beyond that; the data mining that we are pursuing is 
to derive, in addition, the properties of stored clumps, mathematically an addi- 
tional structures imposed by previously mentioned binary relations. To process 
these additional semantics, we use the notion of granular computing [^. 

The idea of granulation is very natural, it is almost everywhere in computer 
science and mathematics. In the context of formal theory, it may be traced back 

N. Zhong and L. Zhou (Eds.): PAKDD’99, LNAI 1574, pp. 24-|^^ 1999. 

(c) Springer-Verlag Berlin Heidelberg 1999 
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to |2I and is outlined in [23 recently. For convenience, this author proposed the 
term ’’granular computing” to label the computational theory of information 
granulation. 



2 Data Mining on Discrete Data 

Discrete data, mathematically, is a set. This section is a pure set theoretical 
approach to the extensional relational model (p], pp.90) and its data mining. 
A relation is a knowledge representation of a set of entities. A ’’column” of a 
relation induces a partition on the set of entities; an attribute value, then, is a 
label of an equivalence class. 

2.1 Single Attribute Tables and Partitions 

Let ?7 be a set of entities, and C be an attribute domain that consists of mean- 
ingful labels. In database theory, elements in C are the primitive data (attribute 
values). However, from data mining aspect, each element represents a com- 
mon property of some entities. As the example given below, the restaurants 
/Di, /I? 2 , /D 3 have a common property, namely, they are all located in West 
Wood. We will call C concept space and its elements elementary concepts. Note 
that adjective ’’meaningful” is to human, as far as computer systems are con- 
cerned, ’’meaningful” or not, a label is bits and bytes. We will be interested in 
a very simple type of knowledge representations. A map from the entities to 
the labels, P : U — > C, is called a single attribute representation or simply a 
r'epresentation. We will assume P is onto, that is, every elementary concept has 
an entity mapped onto it; this is reasonable, since there is no point to consider 
labels that refer to nothing. Its graph (u, P{u)) is called a single attribute table; 
’’physically” the table has two columns. Since the map and its graph determine 
each other, we will use the two terms interchangeable. 

The representation P induces a partition, denoted by P again, on U . It is 
clear that P can be factored through the quotient sets: 

P:U — > U/P — > C, 

The first map is called the natural projection, the second the naming map. An 
elementary set plays two roles: one as an element of the quotient set, another 
as a subset of U. We can regard the first role as the canonical name of the 
second. 

Conversely, suppose we are given an equivalence relation P and each elemen- 
tary set is given its canonical name. Then, we have a map from the entities to 
the canonical names 



u — > [u] — > NAME{[u]), 

where, as usual, [u] denote the elementary set that contains u. The map and 
its graph is called a canonical single attribute representation or canonical single 
attribute table. 
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Propositon 1. 

1. There is a one-to-one correspondence between canonical single attribute ta- 
bles and partitions (equivalence relations). 

2. The relationships among attribute values (elementary concepts) are defined 
by the corresponding elementary sets. 

In the following example, we can see from the meaningful labels that some restau- 
rants are located in West Wood, West LA, and Brent Wood (these symbols means 
something to human), but no information for human when canonical names are 
used. For computer systems either choice gives rise to the same mathematical 
structures of the stored data. 

Example. Let U = {IDi^ID 2 , ■■■IDj} be a set of 7 restaurants. Assume U is 
partitioned into elementary sets by their location. We give these elementary sets 
meaningful names (locations) and canonical names. For comparison, we combine 
two single attribute representations into Table ^ 



Restaurants 


Locations 




Restaurants 


Canonical names 


IDi 


West Wood 




IDi 


{IDi,ID2,ID3} 


ID 2 


West Wood 




ID 2 


{IDi,ID2,ID3} 


ID 3 


West Wood 




IDs 


{IDi,ID2,ID3} 


ID 4 


West LA 




ID 4 


{ID4,IDs} 


IDs 


West LA 




IDs 


{ID4,IDs} 


IDs 


Brent Wood 




IDs 


{IDs, IDr} 


IDr 


Brent Wood 




IDy 


{IDs, IDr} 



Table 1. Two Single Attribute Tables 



2.2 Multiple Attributes and Multiple Partitions 

Now, it is easy to generalize the idea to multiple attributes. A multiple attribute 
representation is simply a join of single attribute representations. Its graph is 
called a multiple attribute table or simply table. It is easy to see table is equiv- 
alent to information table mu. We will use these terms interchangeably. Pawlak 
m called the pair, the universe and a multiple partition, a knowledge base. We 
have called it rough structure; see Section o 

Proposition 2 

1. There is a one-to-one correspondence between tables and multiple partitions 
(knowledge bases;rough structure). 
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2. The relationships among attribute values (elementary concepts) of a table 
are defined by the elementary sets of a multiple partition. 

For example, inclusion of an elementary set in another elementary set is an 
inference on the corresponding elementary concepts; a functional dependency of 
two columns is a refinement of two corresponding partitions. 

2.3 Partition Coarsening and Concept Hierarchy 

Consider a sequence PN of coarsening partitions: 

PIV", n = 1, 2, . . .; PN^ = P,and PN^ A 

where A is a refinement of partitions. In other words, each elementary set of PN^ 
is contained in some elementary set of or conversely each elementary set 

of is a union of elementary sets of PN^. Note that the latter property 

(but not the former) will be generalized to clustered data below. 

We would like to stress that, if two elementary sets are identical, then their 
names should be the same. The universe and the total collection of the nested 
partitions, {U; PN,QN, RN, . . .), is called a high level knowledge base or high 
level rough strueture; see Section 13. IL The concept spaces, such as NAME{PN), 
N AME{QN ) . . ., are called concept hierarchies. 

2.4 Data Mining on Multiple Partitions 

It is easy to see a relation in a relational database is a multiple attribute repre- 
sentation, 

/\,P" -U — > x,NAME{P^), 

where U is the set of entities, and P"^ A = Ij 2, . . . are the attributes. Let if* be 
elementary sets of P* and C* = N AME{Ej). For readability, we rename the 
partitions by P = P^,Q = P^,C = C^,D = . . ., and elementary sets and 

concepts by Pj = Ej, = Ef^, Cj = Cj and Dh = C^- ^From Proposition 2 
we have the following ’’theorem.” 

” Theorem” 1 Automated data mining of a given relation is to derive interest- 
ing properties of elementary sets of the corresponding high level rough structure 
(knowledge base). 



Corollary. 

1. Association rules: A pair (Cj,Dh) is an association rules, if | Pj C] Qh \> 
threshhold. 

2. Inference rules: A formula Cj — > Dh is an inference rule, if Pj C Q/j m 

3. Robust inference rules:A formula Cj —>■ Dh is an inference rule, if Pj C 
and I Pj n Qh \> threshhold m 
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4. High level (robust) inference rules: Write P = PN'^ and Q = QN^ , where 
k > i. A formula Cj Dh is a high level inference rule, if Pj C Qf^ (and 
I Pj n Qh |> threshhold) |8I9| 

5. Soft (robust) inference rules: Write P = PN^ and Q = QN^ where and 

fc > 1. A formula Cj Dh is a soft (robust) inference rule, if Pj C Qh (and 

I Pj n Qh |> threshhold) [Itilhj 

3 Granulations and Mining Clustered Data 

We have shown that data mining in relational theory is deriving properties from 
a set of partitions (equivalence relations). Now we will show that data mining 
(of real life data) is deriving properties of from a set of granulations (binary 
relations) . 

3.1 Granular Structures for Clustered Data 

An equivalence relation partitions the universe into disjoints elementary sets. A 
binary relation decomposes the universe into elementary neighborhoods that are 
not necessarily disjoint. The decomposition is called a binary granulation and 
the collection of the granules a binary neighborhood system Fnmi . We will recall 
few relevant notions. 

1. Binary relations: H is a binary relation on U, ii B C U x U . It defines a 

2. Binary neighborhood system, namely, a map 

N :p — > Bp = {m I (p, u) G B}, 

where Bp is called an elementary neighborhood. We will use B to denote the 
collection {Bp | Vp G B}, the map N or the binary relation B and refer to 
every one of them as a binary neighborhood system. If the binary relation 
is an equivalence relation, elementary neighborhoods are elementary sets. 
Conversely, we can define a binary relation by elementary neighborhoods: 

B = {(p, It) I p G U and u G Bp} 

3. A subset A C B is a definable neighbor hood/ set, if A is a union of elemen- 
tary neighbor hoods/sets. 

4. A space with a neighborhood system is called a NS-space; it is a mild gen- 
eralization of Frechet(U) space [1 9| . 

5. A binary granular structure or simply granular structure consists of 3-tuple 
{U,B,C), where (1) B is an AS'-space imposed by B; see Item 01 (2) B is 
a set of binary neighborhood systems B®,z = 1,2, .. . and (3) (7 is a set of 
concept spaces C® = NAME{B^),i = 1,2,...; each C® is an AS'-space; see 
next Item. If B is an equivalence relation, the granular structure is called 
a rough structure. Rough structure is equivalent to knowledge base and 
granular structure is equivalent to binary knowledge base fniTTT) . 

6. B® induces a binary relation on the concept space C® as follows: We will 
suppress the index i by writing C = and B — B^. Two concepts, Ch and 
Cfe, are B-related iff there exist Ch G Ch and eu G Ck such that (e?,, e^) G B. 
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3.2 Table Representations of Granular Structures 

Let (17, G®, i = 1,2,..., n}) be a binary granular structure. For a fixed i, 

we write B = B'^,C = CL We should note here that U and C are NS-spaces. 
Let us consider the following map 

GT :p — > Bp — > NAME (Bp). 

This map is continuos in the sense of NS-spaces CD; we will call it a single at- 
tribute granular representation. Its graph (p, GT(p)) is a single attribute gran- 
ular table. As before, we will call it canonical single attribute granular table or 
simply canonical granular table, if we use the canonical names. A join of such 
single column table forms a multiple attribute granular representation. Its graph 
of the map will be called granular table] it is equivalent to extended information 
table cnj. If the binary granular structure is a rough structure, then the granular 
table reduces to information table. We would like to caution readers that unlike 
the case of discrete data, the entries in the granular table are not semantically 
independent; there are relationships among elementary concepts (attribute val- 
ues). 

Proposition 3. 

1. There is an one to one correspondence between canonical granular tables 
and granular structures (binary knowledge bases [llUjl. 

2. The relationships among attribute values (elementary concepts) of a multiple 
attribute granular table are defined by the elementary neighborhoods of the 
corresponding granular structure. 

Examples 

Example. Let U be the restaurants as given in Tabled U has granular structures 
described below. 

We will suppress the ID from /Hi , so the set of restaurant isl7={l,2,3,4,5,6,7}. 

1. H is a binary neighborhood system defined by 

Hi = {3,4, 5, 6, 7}, 52 = 13,4,5,6,7}; ^3 = {1,2,3}; B ^ = B ^ = B ^ = 

57 = {4, 5, 6, 7} 

Elementary concepts of 5 are: 

NAME{Bi) = 

NAME{B 2 )= heavy, N am E{B 3 ) = middle, NAME {B 4 ) = 
NAME^Bz) = NAME{Bq) = NAMEiB^) = light. 

5-concept space is an NS-space whose binary relation is described in Table 0 

2. 5 is an equivalence relation defined by 

El = 52 = 53 = {1,2,3}; 54 = 5s = 5e = 57 = {4, 5,6, 7} 



Elementary concepts of E are: 
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NAME{Ei) = NAME{E2) = NAMEiE^) = low, NAME{Ei) = 
NAME{E^) = NAME(Ee) = NAME{Ey) = high. 

i?-concept space is discrete; there is no binary relation on it. 



Objects 


{U, B, G)-attribute 


{U, E, G)-attribute 


IDi 


heavy 


low 


ID2 


heavy 


low 


ID3 


middle 


low 


ID4. 


light 


high 


ID5 


light 


high 


IDs 


light 


high 


ID7 


light 


high 



Table 2. Binary Granulations; entries are semantically interrelated 



(U, B, G)-attribute 


{U, B, G)-attribute 


heavy 


light 


heavy 


middle 


middle 


heavy 


middle 


middle 


light 


light 



Table 3. Semantic Relation on {U, B, C)-attribute 



3.3 Nested Granulations and Concept Hierarchies 

The concepts of nested partitions (equivalence relations) form a concept hierar- 
chy. In this section, we relax the equivalence relation to a general binary relation 

113 - 



Let and (U,B^,C^) be two binary granular structures. 

1. B^ is strongly depended on B^ , denoted by B^ B^ , iff every R^-neighbor- 
hood is a definable R^-neighborhood that is, every R^-neighborhood is a 
union of -neighborhood. If B^ B^, we will say B^ is definably finer 
than B^ or B^ is definably coarser than B^ . 
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2. is definably more general than on denoted by A 

A concept hierarchy is a nested sequence (C/; BiV®, CiV®, i = 0, 1, 2 . . .) of strongly 
depended binary granular structures, more precisely, 

1. BNj is a level elementary neighborhood, j = 1,2, . . . 

2. CNj = N AME(BNj) is a level elementary concept, j = 1, 2, . . . 

3. BN^ 

4. CN^ A 

Such a nested granular structure, {U',BN^, CN^, z = 0, 1, 2 . . .), is called a nested 
high level knowledge base or nested high level granular structure. The sequence 
of concept spaces, CN’’ = NAME(BN^), z = 0, 1, 2 . . ., is also called a concept 
hierarchy. 

3.4 Mining on Multiple Granular Structures 

A relation in real life databases is a granular table. 

’’Theorem” 2 Automated data mining of a relation (in a real life database) 
is to derive interesting properties among the elementary neighborhoods of the 
corresponding high level granular structure (high level binary knowledge base). 

{U; B^, C*, z = 0, 1, 2 . . .) be a collection of granular structures. Let Bj be elemen- 
tary neighborhoods, and C* = N AME{Bj) be elementary concepts. For read- 
ability, we rename the granulations hy P = B^ , Q = P^, C = , D = 

and elementary neighborhoods and concepts by Pj = Bj, Qh = Cj = Cj 
and Dh = C^- We have used Np to denote a neighborhood of p, we will use 
NEIGH{p) in this section. 

Main Theorem. 

1. Soft association rules: A pair (Cj,Dh) is a soft association rule, if 
I NEUGH(Pj) n NEIGH{Qh) |> threshhold. 

2. Continuous inference rules: A formula Gj Dh is a continuous inference 
rule, if NEIGH(Pj) C NEIGH{Qh) 

3. Softly robust continuous inference rules: A formula Cj Dh is a softly 
robust continuous inference rule, if N EIGEl{Pj) C N EIGEl{Qh) and | 
NEIGH(Pj) n NEIGH{Qh) |> threshhold [I^ 

4. (Softly robust)High level continuous inference rules: Suppose PN = B^, 
QN = B1, and j ^ i are two nested granular structures, that is, PiV® ^ 
PN^i+k) and QN^ < QN^^+^\ Write P = PN^ and Q = QN’^, where 
n > m and k > 0. A formula Cj Dh is a (softly robust) high level contin- 
uous inference rule, if NEIGH(Pj) C NEIGH{Qh) (and | NEIGH(Pj) n 
N EIGH{Qh) |> threshhold ) 0 

5. Soft (softly robust) continuous inference rules: Write P = PN^ and Q = 
QN^, where k > 1. A formula Cj Dh is a soft (softly robust) continu- 
ous inference rule, if NEIGP[{Pj) C NEIGH{Qh) (and | NEIGH{Pj) n 
N EIGP[{Qh) |> threshhold) 0 
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These results are immediate from definitions, we label it as a theorem to stress 
it is conceptually important. We can also easily define many variants; we shall 
not do so here. 



4 Conclusion 



Data mining is essentially a ’’reverse” engineering of database processing. The 
latter organizes and stores data according to the given semantics, while the 
former is ’’discovering” the semantics of stored data. So automated data mining 
is a process of deriving interesting (to human) properties from the underlying 
mathematical structure of the stored bits and bytes. 

This paper focuses on relational databases that have additional semantics, 
e.g., [1411314181511 7| . The underlying mathematical structure of such a relation 
is a set of binary relations or a granular structure. Data mining is, then, a pro- 
cessing of the granular structure-granular computing. If there is no additional 
semantics, then the binary relations are equivalence relations and granular com- 
puting reduces to rough set theory. 
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Abstract. Knowledge Discovery in Databases(KDD) should provide 
not only predictions but also knowledge such as rules comprehensible 
to humans. That is, KDD has two requirements, accurate predictions 
and comprehensible rules. The major KDD techniques are neural net- 
works, statistics, decision trees, and association rules. Prediction models 
such as neural networks and multiple regression formulas cannot pro- 
vide comprehensible rules. Linguistic rules such as decision trees and 
association rules cannot work well when classes are continuous. There- 
fore, there is no perfect KDD technique. Rule extraction from prediction 
models is needed for perfect KDD techniques, which satisfy the two KDD 
requirements, accurate predictions and comprehensible rules. Several re- 
searchers have been developing techniques for rule extraction from neural 
networks. The author also has been developing techniques for rule extrac- 
tion from prediction models. This paper briefly explains the techniques 
of rule extraction from prediction models. 



1 Introduction 

Knowledge Discovery in Databases(KDD) means the discovery of knowledge 
from (a large amount of) data, therefore, KDD should provide not only predic- 
tions but also knowledge such as rules comprehensible to humans. Therefore, 
KDD techniques should satisfy the two requirements, that is, accurate predic- 
tions and comprehensible rules. 

KDD consists of several processes such as preprocessing, learning and so 
on. This paper deals with learning. The learning in KDD can be divided into 
supervised learning and unsupervised learning. This paper deals with supervised 
learning, that is, classification and regression. 

The major KDD techniques are neural networks, statistics, decision trees, 
and association rules. When these techniques are applied to real data, which 
usually consist of discrete data and continuous data, they each have their own 
problems. In other words, there is no perfect technique, that is, a technique which 
can satisfy the two requirements, accurate predictions and comprehensible rules. 

Neural networks are black boxes, that is, neural networks are incomprehen- 
sible. Multiple regression formulas, which are the typical statistical models, are 
black boxes too. Decision trees do not work well when classes are continuous [11], 
that is, if accurate predictions are desired, comprehensibility should be sacrificed. 
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and if comprehensibility is desired, accurate predictions should be sacrificed. As- 
sociation rules, which are unsupervised learning techniques, do not work well 
when the right hand sides of rules, which can be regarded as classes, are contin- 
uous [14]. The reason is almost the same as that for decision trees. 

Neural networks and multiple regression formulas are prediction models. De- 
cision trees and association rules are linguistic rules. Prediction models can pro- 
vide predictions but cannot provide comprehensible rules. On the other hand, 
linguistic rules can provide comprehensible rules but cannot provide accurate 
predictions in continuous classes. 

How can we solve the above problem? The solution is rule extraction from 
prediction models. Rule extraction from prediction models is needed for develop- 
ing the perfect KDD techniques, that is, satisfying the two KDD requirements, 
accurate predictions and comprehensible rules. 

Several researchers have been developing rule extraction techniques from neu- 
ral networks, therefore, neural networks can be used in KDD [7], whereas few 
researchers have studied rule extraction from linear formulas. We have devel- 
oped rule extraction techniques from prediction models such as neural networks, 
linear formulas and so on. 

Section 2 explains the problems in the major KDD techniques and how the 
problems can be solved by rule extraction from prediction models. Section 3 
briefly surveys rule extraction from neural networks. Section 4 explains the tech- 
niques developed by the author. Section 5 briefly surveys open problems. 



2 The Problems of Major KDD Techniques 

The major KDD techniques, that is, decision trees, neural networks, statistics 
and association rules are reviewed in terms of the two requirements for KDD 
techniques, accurate predictions and comprehensible rules. 



2.1 Neural Networks 

Neural networks can provide accurate predictions in the discrete domain and the 
continuous domain. The problem is that the training results of neural networks 
are sets of mathematical formulas, that is, neural networks are incomprehensible 
black boxes. 



2.2 Multiple Regression Analysis 

There are a lot of statistical methods. The most typical method is multiple 
regression analysis, therefore, only multiple regression analysis is discussed here. 
Multiple regression analysis usually uses linear formulas, therefore, only linear 
regression analysis is possible and nonlinear regression analysis is impossible, 
while neural networks can perform nonlinear regression analysis. However, the 
linear regression analysis has the following advantages. 
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1. The optimal solution can be calculated. 

2. The linear regression analysis is the most widely used in the world. 

3. Several regression analysis techniques such as nonparametric regression anal- 
ysis, multivariate autoregression analysis, and so on have been developed 
based on the linear regression analysis. 

Linear regression analysis can provide appropriate predictions in the continuous 
domain and discrete domain. The problem is that multiple regression formulas 
are mathematical formulas, which are incomprehensible black boxes too. There- 
fore, rule extraction from linear regression formulas is important. 

2.3 Decision Trees 

When a class is continuous, the class is discretized into several intervals. When 
the number of the intervals, that is, the number of the discretized classes, is 
small, comprehensible trees can be obtained, but the tree cannot provide accu- 
rate predictions. For example, the figure on the left side in Fig. 1 shows a tree 
where there are two continuous attributes and a continuous class. To improve 
the prediction accuracy, let the number of intervals be large. When the number 
of intervals is large, the trees obtained are too complicated to be comprehensible. 

^From the above simple discussion, we can conclude that it is impossible to 
obtain trees which can satisfy the two requirements for KDD techniques, accurate 
predictions and comprehensible rules at the same time. Therefore, decision trees 
cannot work well when classes are continuous [11]. 

As a solution for continuous classes, for example, Quinlan presented Cubist. 
Cubist generates piecewise-linear models [12], which are a kind of regression 
trees. The figure on the right side in Fig. 1 shows an example. As seen from this 
figure, the tree is an extension of linear formulas, so the tree is a prediction model, 
that is, incomprehensible. As a result, this solution has solved the inaccurate 
prediction problem but has generated the incomprehensibility problem. 

2.4 Association Rules 

Association rules are described as a — > 6, where b can be regarded as a class. 
Association rules do not work well when “classes” like b in the above rule are 
continuous. When there are many intervals, the rules are too complicated to be 
comprehensible, whereas there are few intervals, the rule cannot provide accurate 
predictions, therefore some concessions are needed [14]. Some techniques such 
as fuzzy techniques can be applied to the above problem [6], but while fuzzy 
techniques can improve predictions, they degrade the comprehensibility. 
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Fig. 1. Decision trees with a continuous class 



The above table shows the summary. 

A in neural network means that training results are incomprehensible. 

A in linear regression means that training results are incomprehensible. 

A in regression tree means that training results are incomprehensible. 

A means that decision trees cannot work well in continuous classes. 

Association rules are omitted in the above table, because association rules do 
not have classes. However the evaluations for association rules are the same as 
those for decision trees. 

Thus, we conclude that there is no technique which can satisfy the two re- 
quirements for KDD techniques, that is, the technique which can provide accu- 
rate predictions and comprehensible rules in the discrete domain and the con- 
tinuous domain. 

2.5 The Solution for the Problem 

The solution for the above problem is extracting comprehensible rules from pre- 
diction models such as neural networks, multiple regression formulas, and so 
on. When rules are extracted from prediction models, the rules are not used for 
predictions, but used for only humans comprehension. The prediction models 
are used to make the predictions. A set of a prediction model and a rule(rules) 
extracted from the prediction model is the perfect KDD technique. 

How a rule extracted from a neural network is used is briefly explained. When 
a neural network predicts, the neural network outputs only a class or a figure. 
Humans cannot understand how or why the neural network outputs the class or 
the figure. A rule extracted from the neural network explains how or why the 
network outputs the class or the figure. 

For example, let a network be trained using process data consisting of four 
attributes, that is, a temperature (t), a pressure (p), a humidity (h), and a class, 
the quality of a material (q) . Let the rule extracted be as follows: 

(200 < t < 300) V (p < 2.5) V (70 < < 90) ^ g < 0.2. 

Assume that the network is used for a prediction, and the inputs for the 
network are t = 310, p = 2.0, and = 60 and the output from the network is 0.1, 
which means low quality. The above rule shows that the network outputs 0.1, 
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because the pressure is 2.0, which is below 2.5, that is, p < 2.5 holds. Without 
the rule, we cannot know how or why the network outputs 0.1 indicating the low 
quality. Note that the rule extracted from the neural network is not used for the 
predictions, which are made by the neural network. 

The rule extraction technique has been implemented in a KDD tool KINO 
(Knowledge INference by Observation). For details, see [15]. 

3 The Survey of Rule Extraction from Neural Networks 

This section briefly surveys algorithms for rule extraction from neural networks. 
Some rule extraction algorithms are based on the structurization of neural net- 
works [5], but the rule extraction algorithms cannot be applied to neural net- 
works trained by another training method such as the back-propagation method. 
It is desired that rules can be extracted from any neural network trained by any 
training method. The rule extraction algorithms based on the structurization of 
neural networks are out of scope. 

There are several algorithms for rule extraction from neural networks [1], [2]. 
The algorithms can be divided into decompositional algorithms and pedagogical 
algorithms. Decompositional algorithms extract rules from each unit in a neural 
network and aggregate them into a rule. For example, [3] is a decompositional 
algorithm. Pedagogical algorithms generate samples from a neural network and 
induce a rule from the samples. For example, [4] is a pedagogical algorithm. 
Decompositional algorithms can present training results of each unit in neural 
networks, so we can understand the training results by the unit, while pedagog- 
ical algorithms can present only the results of neural networks, so we cannot 
understand the training results by the unit. Therefore, decompositional algo- 
rithms are better than pedagogical algorithms in terms of understandability of 
the inner structures of neural networks. 

Rule extraction algorithms are compared in several items such as network 
structures, training methods, computational complexity, and values. 

network structures : This means the types of network structures the algo- 
rithm can be applied to. Several algorithms can be applied only to particular 
network structures. Most algorithms are applied to three-layer feedforward 
networks, while a few algorithms can be applied only to recurrent neural net- 
works, where DFAs (Deterministic Finite-state Automata) are extracted [10]. 
training methods :This means the training methods the rule extraction algo- 
rithm can be applied to. Several rule extraction algorithms depend on train- 
ing methods, that is, the rule extraction algorithms can extract rules only 
from the neural networks trained by a particular training method [13], [3]. 
The pedagogical algorithms basically do not depend on training methods, 
computational complexity : Most algorithms are exponential in computa- 
tional complexity. For example, in pedagogical algorithms, the total number 
of samples generated from a neural network is 2”, where n is the number of 
inputs to the neural network. It is very difficult to generate many samples 
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and induce a rule from many samples. Therefore, it is necessary to reduce 
the computational complexity to a polynomial. Most decompositional algo- 
rithms are also exponential in computational complexity, 
values :Most algorithms can be applied only to discrete values and cannot be 
applied to continuous values. 

The ideal algorithm can be applied to any neural network, can be applied 
to any training method, is polynomial in computational complexity, and can be 
applied to continuous values. So far, there are a few algorithms which satisfy the 
first and second items [4] . There has been no algorithm which satisfies all of the 
items above. 



4 The Rule Extraction Techniques of the Author 

4.1 The Outline 

We have developed rule extraction techniques for prediction models. We pre- 
sented an algorithm for extracting rules from linear formulas and extended it 
for the continuous domain [16]. Afterwards, we presented efficient algorithms for 
extracting rules from linear formulas [18] and applied them to discover rules from 
numerical data [9]. Currently we are applying the algorithms to nonparametric 
regression formulas to discover rules from images [21]. 

We extended the algorithms for linear formulas to the algorithms for neural 
networks [19], and modified them to improve the accuracies and simplicities [20]. 

The rule extraction algorithm from a linear formula is almost the same as the 
rule extraction algorithm from a unit in a neural network. Therefore, hereinafter 
in this section, we focus on neural networks. For the rule extraction algorithms 
from neural networks, the algorithms basically satisfy the four items listed in the 
preceding section, that is, the algorithms can be applied to any neural network 
trained by any training method, are polynomial in computational complexity, 
and can be applied to continuous values. However, the algorithms have a con- 
straint, that is, they can be applied only to neural networks whose units’ output 
functions are monotone increasing. The algorithms are decompositional algo- 
rithms. 

There are two kinds of domains, that is, discrete domains and continuous 
domains. The continuous domain will be discussed later. The discrete domains 
can be reduced to {0, 1} domains by dummy variables. So only {0, 1} domains 
have to be discussed. In the {0, 1} domain, the units can be approximated to 
the nearest Boolean functions, which is the basic idea for the rule extraction 
algorithm. The approximation theory is based on multilinear function space. 
The space, which is an extension of Boolean algebra of Boolean functions, can be 
made into a Euclidean space and includes linear functions and neural networks. 
Due to space limitations, the details can be found in [22]. 
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Fig. 2. Approximation 



4.2 The Outline of the Rule Extraction Algorithms 



The basic algorithm is that the units of neural networks are approximated by 
Boolean functions. The output functions are sigmoid functions, so the values of 
the units are [0,1]. Let (ft) {i = 1,..,2") be the values of a unit of a neural 
network. Let (gi) be the values of Boolean functions, that is, = 0 or 1. The 
basic algorithm is as follows: 

^ /!(/.> 0.5), 

\0(/, <0.5). 

This algorithm minimizes Euclidean distance. 

Fig. 2 shows a case of two variables. Crosses stand for the values of a unit of 
a neural network and circles stand for the values of a Boolean function. 00, 01, 10 
and 11 stand for the domains, for example, 00 stands for x = 0 ,y = 0 . In this 
case, there are four domains as follows: 

( 0 , 0 ), ( 0 , 1 ), ( 1 , 0 ), ( 1 , 1 ) 

The values of the Boolean function g{x,y) are as follows: 

5(0,0) = 1, 5(0,1) = 1, 5(1,0) = 0, 5(1,1) = 0. 

The Boolean functions corresponding to the points are as follows: 

(0,0)<t4-X5, { 0 ,l)<^xy, {l, 0 )<^xy, ( 1 , 1 ) xy. 

A Boolean function g(x,y) of two variables are represented as follows: 
y) = 5(0, Q)xy V 5(0, l)xy V 5(1, Q)xy V 5(1, 1)0:5. 

Therefore, in the case of Fig. 2, the Boolean function is as follows: 
g{x, y) = 5(0, Q)xy V 5(0, 1)^5 V 5(1, 0)0:5 V 5(1, 1)0:5 
= 10:5 V 10:5 V 0x5 V 0x5 = xyy xy = X. 

The basic algorithm is exponential in computational complexity, therefore, 
polynomial algorithms are needed. The author has presented polynomial algo- 
rithms. The details are omitted due to space limitations. The outline of the 
polynomial algorithms follows. 



1. Check if a term exists in the Boolean function after the approximation. 

2. Connect the terms existing after the approximation by logical conjunction 
to make a DNF formula. 
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3. Execute the above procedures up to a certain (usually two or three) order. 

After rules are extracted from all units, the rules are aggregated into a rule for 
the network. 

If accurate rules are obtained, the rules are complicated, and if simple rules 
are obtained, the rules are not accurate. It is difficult to obtain rules which are 
simple and accurate at the same time. We have presented a few techniques to 
obtain simple and accurate rules. For example, attribute selection works well for 
obtaining simple rules [20]. 

4.3 The Continuous Domain 

When classes are continuous, rule extraction from neural networks is important. 
However, in the continuous domain, few algorithms have been proposed. For 
example, algorithms for extracting fuzzy rules have been presented [8] , but fuzzy 
rules are described by linear functions, and so fuzzy rules are incomprehensible. 

Continuous domains can be normalized to [0,1] domains by some normaliza- 
tion method. The author presented a qualitative expression system correspond- 
ing to Boolean functions, in the [0,1] domain. The system consists of direct 
proportion(?/ = x), reverse proportion(?/ = 1 — a;), conjunction and disjunction. 
The inverse proportion (y = 1 — a;) is a little different from the conventional 
one (y = —a;), because y = 1 — a; is the natural extension of the negation in 
Boolean functions. The conjunction and disjunction can also be obtained by a 
natural extension. The functions generated by direct proportion, reverse pro- 
portion, conjunction and disjunction are called continuous Boolean functions, 
because they satisfy the axioms of Boolean algebra[17]. 

In the [0,1] domain, the units in a neural network can be approximated to 
the nearest continuous Boolean functions. The basic algorithm and polynomial 
algorithms are almost the same as in the discrete domain. For example, a rule 
is described as follows. 

{t A p) W h ^ q, 

where t stands for temperature, p stands for a pressure, h stands for humidity, 
and q stands for the quality of a material. The above rule means 
“(temperature increases and pressure increases) or humidity decreases, then 
quality increases.” or 

“(temperature decreases and pressure decreases) or humidity increases, then 
quality decreases.” 



5 Open Problems 

1. When classes are continuous, other techniques such as decision tree do not 
work well as explained in Section 2. Therefore, rule extraction from pre- 
diction models with continuous classes are important. Continuous Boolean 
functions work well for linear formulas, but cannot express the detailed infor- 
mation on neural networks, because the neural networks are nonlinear. The 
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continuous Boolean functions are insufficient, therefore, extracting rules de- 
scribed by intervals such as 

(200 < t < 300) V (p < 2.5) V (70 < h < 90) — > q < 0.2 from neural networks 
is needed. 

2. Usual real data consist of discrete data and continuous data. Therefore, rule 
extraction from a mixture of discrete attributes and continuous attributes is 
needed. 

3. Training domains are much smaller than prediction domains. For example, in 
the case of voting-records, which consist of 16 binary attributes, the number 
of possible training data is 2^®(= 65536), while the number of training data 
is about 500. The outputs of a neural network are almost 100% accurate for 
about 500 training data, and are predicted values for the other approximately 
65000 data. These predicted values are probabilistic, because the parameters 
for the neural network are initialized probabilistically. Since 65000 is much 
greater than 400, that is, the probabilistic part is much larger than the 
non-probabilistic part. Therefore, when a rule is extracted from a neural 
network, the predicted values of the neural network has to be dealt with 
probabilistically. 

We have developed two types of algorithms. The first one deals with the 
whole domain equally [19]. The second one deals with only the training 
domain and basically ignores the prediction domain [20]. Both algorithms 
can be regarded as opposite extremes. We also have developed an algorithm 
which deals with the prediction domains probabilistically [9]. Future work 
includes the development of algorithms dealing with the prediction domains 
appropriately. 

4. There are several prediction models based on linear formulas. It is desired 
that rule extraction techniques from linear formulas be applied to nonpara- 
metric regression analysis[21], (multivariate) autoregression analysis, regres- 
sion trees, differential equations (difference equations), and so on. 

6 Conclusions 

This paper has explained that rule extraction from prediction models is needed 
for perfect KDD techniques. This paper has briefly reviewed the rule extraction 
techniques for neural networks and has briefly explained the rule extraction 
algorithms developed by the author. There are a lot of open problems in the 
field of rule extraction from prediction models, therefore the author hopes that 
researchers will join this field. 
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Abstract. In practical applications, some property is represented by 
a pair of related attributes. For example, blood pressure, temperature 
changes etc. The existing data mining approaches for association rules 
can not tackle those cases, because they treat every attribute indepen- 
dently. In this paper, as a special kind of correlation, we express the pair 
of attributes as a range-type attribute. We define a set of fuzzified re- 
lations between ranges and revise the definition of association rules. We 
also propose effective algorithms to evaluate the measures for ranking 
association rnles on related nnmeric attributes. 

keywords: Data mining. Association rules. Numeric attributes, Fuzzified 
relations 



1 Introduction 

In practical applications, there may exist correlation among attributes. In this 
paper, we pay attention to a special kind of correlation, that is, a property is 
represented by a pair of related attributes. For example, blood pressure, tem- 
perature changes, time interval etc. In this case, it is nature to treat a pair of 
values as a range. So an association rule on range-type attribute is expected. 

Let us see an example. There are a collection of experimental data which 
record occurrence of an event, say Y, and the temperature changes, say X, in 
each fixed time interval. The temperature is a continual variable, which should 
be discreted in some way. Assume we record only the highest and lowest temper- 
ature in each interval, denoted as Xi and X 2 , respectively. Assume we also know 
according to practical experience that Y is related with not only the difference 
of temperature (distance between Xi and X 2 ) but also the highest and lowest 
temperature (the values of Xi and A 2 ). In this case. It is better to treat [Ai, A 2 ] 
as a range than the three independent factors. Hence, an association rule, like 
[Ai, A 2 ] Cj ii ^ (Y = yes) is expected, where is a binary relation between 
two ranges, whose membership function is a value in [0,1]. If A is a true subset of 
R, the membership of (X,R) GCf is 1. If A does not overlap with R, the mem- 
bership is 0, Otherwise, it is a value between 0 and 1 determined by the percent of 
A covered by R . We call this relation fuzzified subset relation in this paper. The 
existing data mining approaches on numeric or interval data [illJIbj can not tackle 
this case appropriately, because they treat each attribute independently. By their 
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approaches, an association rule, like Xi G [ai,6i] A X 2 € [02,62] => (1^ = Yes) 
or {Xi,X 2 ) G P ^ {Y = Yes), where P is a rectangle, can be extracted. In this 
rule, Xi and X 2 are treated as two independent attributes. 

In this paper, we focus on mining the association rules on range-type at- 
tribute. For understanding the contribution of this study, we review briefly the 
work in the discovery of association rules. Given a huge database, an association 
rule is a relationship between two conditions Ci and C 2 such that if a tuple meets 
the condition C\, then it also satisfies the other condition C2 with a probabil- 
ity (called confidence), denoted by Ci C2. Agrawal et al. m considered the 
Boolean-type data in their study of finding association rule from a huge database 
of sales transactions. An unordered attribute (also called categorical attribute) 
can be transformed into this case by treating each value v of the attribute A as 
a new attribute A.v. In this case, the membership of specified data to a given 
set is necessary in defining the measures of the patterns of interest. Srikant and 
Agrawal [Zj studied a more complex case that the attribute may be a quantita- 
tive one. A quantitative attribute is an ordered discrete one, but the separation 
between two neighbor values has no meaning. In this case, whether a specified 
datum is in a given range need to be considered. Minker et. al. |E| and Fukuda 
et. al. P] further considered more general case that the data are interval, that 
is, ordered data for which the separation between data points has meaning. In 
this case, the distance between data should be considered in addition. 

From the above brief review, we see that the problem proposed in this paper 
is different with the existing ones. A pair of related numeric attributes can be 
viewed as a range. Hence this paper discusses mining association rules on range- 
type data. The concepts of association rules as well as measures of support 
and confidence should be revised. Especially, the overlapped part between two 
ranges should be considered in the definition of measures. Moreover, an efficient 
algorithm to extract optimized association rules is proposed. 

The remainder of this paper is organized as follows. In section 2, we define 
three fuzzified relations on range data. In section 3, we revise the definitions 
of support and confidence for an association rule, and the concept of optimized 
association rules. We also propose a naive algorithm with time complexity 0(v?) 
to extract the optimized rules, where n is the number of buckets. In section 4, 
we propose a set of iterative formulas for evaluation of the measures. Its time 
complexity is 0{n). It is much better than the naive one. In section 5, a set of 
experiments are designed for verifying the efficiency of our algorithm. 



2 Fuzzified Relations on Ranges 

The operations between ranges like U (union), H (intersection), and — (differ- 
ence) are defined as usual. For example, [5, 10] n [2, 8] = [5, 8], [5, 10] U [2, 8] = 
[2, 10], [5, 10] — [2, 8] = [8, 10], and [2, 8] — [5, 10] = [2, 5]. However, the relations 
between ranges like C, D and = should be reconsidered as we showed in Intro- 
duction. Similar to the way in fuzzy set theory, we define a fuzzified relation 
between ranges by a membership function whose domain is [0,1]. 
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Definition 1. Assume there are two comparable (closed) ranges S = [si,S 2 ] 
and T = [^ 1 ,^ 2 ]- Let || • || denote the length of a range. A fuzzified subset relation 
between S and T, denoted by (5, T) GCj or S' Cj T, is defined by the following 
membership function: 



dc,{S,T) 



ll^ll 



Similarly, a fuzzified superset and equivalence relation between S and T, denoted 
by S Dj T and S =/ T, is defined by the following membership function, 



respectively: d^AS,T) = 



||S_^|| 

mi 



and d=AS, T) = 



|snT|| 

|SUT||’ 



Example 1. Consider the following two ranges S = [5, 20] and T = [10, 15]. 

dc,{S,T) = (II [5, 20] n [10, 15]||)/||[5,20]|| = 1/3 
d2,(S,T) = (II [5, 20] n [10,15]||)/||[10,15]|| = 1 
m(S,T) = (||[5,20]n [10,15]||)/||[5,20]U [10,15]|| = 1/3 

As we mentioned in Introduction, which fuzzified relation is used depends on 
application. In the following discussion, a fuzzified relation is denoted as if 
there is no necessity to distinguish them. 



3 Optimized Association Rules 

3.1 Basic Definitions 

In this paper, we are interested in a specific form of association rules on range- 
type attribute, {X [a, 5]) (Y = Yes), where is one of the fuzzified 
relation defined in Definition Q 

Association rules are typically ranked by a set of measures. Common mea- 
sures are support and confideiice |ll2| . support of a rule is the frequency of the 
occurrence of the rule (or the premise of the rule) in the database, confidence is 
the strength of the rule implication. For example, for a database of sales transac- 
tions, one usually measures a buying pattern A S ( it means that users who 
buy item A almost always buy item B) by the two measures |lti| . The measure 
support quantifies how often the users who buy both item A and B (or only A) 
as a fraction of the total number of users. The other measure confidence is the 
ratio of the users who buy item B to the users who buy the item A. 

The measures for patterns of interest for categorical attribute P, quantitative 
attribute (ordered discrete attribute) |Z] or continuous attribute [h|3j have been 
defined. However, their definitions can not be applied directly for range-type 
attributes, because their membership functions are not Boolean ones. Here, we 
define a set of measures for range-type attributes. 

Assume there is a database D = {ti, - ■ ■ ,tn}, which contains a range-type 
attribute X and a Boolean attribute Y, ti.X{= \li,Ui]) is the value of X in the 
tuple ti. Let R = [L, U] be a given range, C a Boolean condition on Y . We 
denote Dc = {ti\ti G D A C}, and = {ti\ti G Dc AU.X f] R </}• 
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Definition 2. The support of a rule X C is defined as below 

a{R) = \Dn\/\D\ (1) 

where | • | represents the size of a set. 

The support of a rule is the ratio of the number of tuples in the database that 
satisfies the given condition C and intersects with the given range R in the 
attribute X. Since the condition C is fixed in this study, the range R is the 
unique parameter in this definition. 

However, the measure a{R) considers only if a range is intersected with the 
given range R. The bigger the i?, the bigger the a{R) is. Hence a{R) is not 
enough to define an appropriate range R for the association rule. We have to 
consider how many percent of the R is covered on the average. 

Definition 3. The intensity of a rule X R ^ C is defined as below, 

o/p'l _ ^ueD^dr^fiR^U.X) 

m — <"> 

where d,^j{R^ri) is one of the membership function defined in Definition Q 

The measure /3(i?) quantifies the percent of the given range R covered by all 
tuples in the database on the average. The smaller the R, the bigger the j3{R) 
is. 

Example 2. Assume there is a database D(X,Y) = { ([0,8],!), ([0,5],!), ([5,10],!), 
([8,10], 0), ([7,15],!) }. Consider two rules with ranges i?i = [5,10] and Ri = 
[0, 10], respectively. The values of d^j{R\,X) are 0.6, 0, 1, 0.4 and 0.6 respec- 
tively. The values of d 3 j.{R 2 ,X) are 0.8, 0.5, 0.5, 0.2 and 0.3 respectively. So, 
/3(i?i) = (0.6 -l-l-b0.6)/3= 0.733, and /3(i?2) = (0.8 -b 0.5 -b 0.5 -b0.3)/4= 0.525. 
Hence, the intensity of the rule with Ri is larger than that with i?2- 



Definition 4. The confidence of a rule (Y i?) =b C is defined as below. 



7(i?) 



^ti£Dcd.^f{R,ti.X) 

Etisod^ f {R, ti-X) 



( 3 ) 



7(i?) reflects the ratio of the weighted number of ranges in to that in D. 
Example 3. Consider the database and range R\ and i?2 in Example 0 We have 

EtieDcd^f(Ri,ti.X) = 2 . 2 ; Et-eDd^f{Ri,ti.X) = 2 . 6 ; Et-eDcdn^j{R2,ti-X) = 

2.1 and Et-^od..^ j.{R 2 ,ti.X) = 2.3. Then, 7(i?i) = 2. 2/2. 6 = 0.85, and 7(i?2) = 
2. 1/2. 3 = 0.91. Hence, the confidence of the rule AT Dj i?2 =b C is larger than 
that of the rule AT D/ =b C. 



Definition 5. A rule AT i? =b C is called interested if its support a(R), 
intensity /3(R) and confidence j(R) are not less than the given minimum thresh- 
olds, respectively. Among interested rules, an optimized rule maximizes one of 
the measures in all interested rules. 
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For a given range, we can compute the three measures by scanning the 
database one time. However, when computing the optimized rules, we need to 
check possibly infinite candidate ranges and we may have to scan the whole 
database more than one time. These two points are not acceptable for a huge 
database in practice. Fortunately, an approximate solution is enough for practi- 
cal applications. In the remainder of this section, we propose a naive algorithm 
to mine the optimized association rules based on bucketing techniques. 



3.2 Bucketing 

For processing a continue attribute, bucketing method is widely used. In existing 
algorithms, two kinds of buckets are used: equi-size and equi- volume. The former 
requires that the size of buckets are the same, but the number of data in the 
buckets are different. The later requires that the number of data in buckets are 
the same, but the size of buckets are different. Our algorithm has no special 
requirement for buckets. User can give a sequence of intervals, which divides the 
whole domain of the attribute, according to the requirement of application. We 
then construct a set of buckets from the intervals 

Let MAX and MIN be the maximum and minimum bound of the contin- 
uous attribute X. Assume the range [MIN,MAX] is divided into a set of in- 
tervals /i, / 2 , • • • , /m, where A = [xi-\,Xi]. From the interval set {/i, • • • , /m}, 
we construct a set of buckets Hs,t = ^s<k<t{Ik), where 1 < s,t < m. We 
call B — Hi_ 2 , • • • , Bm,m\ a bucket set. Let the number of bucket be n. 

Clearly, n = m{m -|- l)/2. The bucket set can be organized as a triangle (See 
Figure Pi. The point with label ij represents the bucket Bij. A range [l,u] is 
distributed to bucket Bg^t if and only if Zs-i < I < h and Ut-i < u < Ut, 
where Bs+i,t-i = [Zs-i-i, Wt-i]) Bs,t = [h,ut\. Ranges in the same bucket are 
treated as the same. This procedure is called normalization of the data. The 
length of ranges in a bucket are equal to the length of the bucket. One tuple in 
the database belongs to one and only one bucket. Let Us^t denote the number 
of tuples in bucket Rs,t, and Vg^t denote the number of tuples in bucket Bg^t 
that satisfy the condition C. Moreover, let Ig^t denote the interval obtained by 
combining a set of consecutive intervals Ig,Ig+i, ■ ■ ■ ,It- In this paper, we con- 
sider all Ig^t as candidates of the optimized range. Obviously, the interval Ig^t 
corresponds to the bucket Bg^f 



3.3 Naive Algorithm 

Property 6. Let Ug^t = ^{Ig^t.U-X), 14. t = and 

IFs.t = P = {(fc, 0|1 ^ k <t, max{s, k) < I < m} Then, 

{ Ug,t = ^{k,i)ep{dn^f{Is,t, Ik,i) * Wfe.i) 

14, t = ^(k,l)ep{dr^f{Ig,t, Ik,l) * Vk,l) 

H4,t = X(^k,i)epVk,i 



( 4 ) 



Mining Association Rules on Related Numeric Attributes 



49 



Proof. It can be proved from the fact that the ranges intersected with Is^t are 
exactly those in the buckets in polygon (Figure Ql. That 

is, the buckets in {Bkj\{k,l) G P}. 




Fig. 1. Buckets intersected with Ig^t 



Algorithm 7. (Evaluating the optimized rules) 

si: evaluating {7s_t,Fs,t,and Wg^t by the formula 
s2: for (1 < s < t < m) do { 

evaluating a{Ig^t), P(Is,t) and j{Ig,t) according to their definitions, 
if (a(/s,t) >dg/\ f3{Is,t) >0ih "f{Is,t) > &C ) then 

insert (/«,(, a(/s,t), /3(/s,t), 7 ( 4 , *)) into table T; 

}; 

s3: Evaluating such that 7 (/s*,f) = maxi^ ^^TjiIs,t)', 

Although the algorithm evaluates only the optimized confidence association rule, 
it is similar to evaluate the other optimized rules. 

Theorem 8. The complexity of algorithm^ is 0(nf), where n is the number of 
the buckets. 

Proof. The most expensive step is obviously the step si of evaluation of [7s,t , 
Vg^t and Wg^f We count the number of arithmetical operations in the step si. 

For an arbitrary interval Ig^t, the polygon P in Figure Q contains mt — t(t — 
l)/2 — s(s — l)/2 buckets, where m is the number of the intervals. From their 
evaluation formulas, we know that the number of the arithmetical operations is 
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the constant times of the number of the buckets in the polygon P. Hence, the 
complexity of the step si is, ^ = 0{rn^) = 

0{n^). where n is the number of buckets. 

However, O(n^) is too complex for a data mining algorithm. In the next 
section, we derive a set of efficient iterative formulas. 



4 Efficient Evaluation Algorithms 



The naive algorithm computes the measures for each candidates independently. 
In fact, there are tight relationship between candidates. We can make use of 
them to reduce the complexity of the evaluation. 

Since Wg,t is independent with the fuzzified relations, we consider its evalu- 
ation at first. 



Property 9. Let W'^ ^ be the number of ranges in the database that are con- 
tained in the interval Ig,t and satisfy the condition C. That is, ^ = \{ti.X C 
e A C}|. Then, 



W',= 



Vs,s 

W' 






,t = s 
,t > s 



This formula says that ^ can be evaluated iteratively. 
Proof. Since the buckets are pairwise disjoint, we have 



C e D A C}\ 

= \{ti-X C G Bk,i A{1 <k <l <m) AC}\ 

= liU-X C e Bk,i A{s <k <l <t) AC}\ 

= ^s<k<i<t{\{ti-X C Is^t\U G Bk,i A C}|) 

= Xs<k<l<t{vk,l) 

= Xs<k<l<t-l{vk,l) + Xs+l<k<l<t{vk,l) — ^s+l<k<l<t-l{vk,l) + Vs.t 



From this property, we can evaluate all Wa,t by the following property 
Property 10. For arbitrary interval Ig,t, we have 



= (Km - - Kl.m) 



( 5 ) 



In the sequel, we consider the iterative formulas for the intensity and confi- 
dence which are dependent with the definition of fuzzified relation. 



Property 11. For the fuzzified relation we have 



U.a,t = 



f ■^l<k<s-^s<l<m('^k,l) jS — t 

KsK/i't- s + i) ,s^t 



Vs.t = 



^l<k<sXs<l<m(k^k.l) 5 S — t 

KsVh^/it- S+l) ,S^t 
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Proof. We prove the first formula only. The proof for the second one is similar. 

1. (case t = s) Since the interval /g,s is a unit interval, every range intersected 
with Is^s covers Is^s also. Hence, d^^{Is,s,Ik,i) = 1 if the range Ik,i is intersected 
with Js_s. From the formulas in PropertyEl we have Us,t = Si<k<s^s<i<m{uk,i). 

2. (case s ^ t) Consider an arbitrary bucket Bkj. in the polygon P in Figure 
^ If it covers the interval it contributes to every Bi^i{s < i < t). If it is 
intersected with it contributes to those which are contained in both 
and Ik,i- Hence, each bucket contributes Us,t as the same times as the t — s + 1, 
that is, the length of the interval Ig^t- Hence, Us,t = ^s<k<tUk,k/{t — s + 1). 

Evaluating Us^t and Vs^t by these formulas is still too complex. We further 
construct a set of iterative formulas for them. 



Property 12. For the fuzzified relation D y, let C/' ^ = 
Then, 


-- 




, s = t = 1 




, s = t = i > 1 


1 Kt-1 + Kt 


,s^t 




,s = t = l 


= { vu,^-l - 


, s = t = i > 1 


1 K't-i + K.t 


,s^t 



Proof. Straightforward from Property 1 1 1 1 



Property 13. For the fuzzified relation Dy, we have 
Proof. Straightforward from Property m and Property d 



( 6 ) 



We now consider the evaluation of measures for the fuzzified relation Cy. 
Property 14. For the fuzzified relation Cy, we have 



Us.t = 



[i - j) + {j -i + 1) 

Us,t-i + Ut^t 



,s = t=l 
,s = t = i (7) 
,s 



Vs,t = 



- j) + - z + 1) 

Vs,t-i + Vt,t 



,s = t=l 
,s = t = i (8) 
,s^t 



Proof. It is easy derived from their definitions and the fact that dcf{Is,s,Ik,i) = 
1/Ii4y|l when Ik,i covers Is,s- 
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The algorithm for evaluating the optimized rules is similar to Algorithm Q 
but replacing step si with the following one. 

si: evaluate Us,t and 14,t and Wg,t by the iterative formulas (01 and (0 for 
or formulas t3),(0 and 0 for C^; 



Theorem 15. The time complexity of the new algorithm is 0{n). 

Proof. The complexity except step si is 0{m^). Now we consider the step si 
for the case of D/. For computing U[ i, it takes m arithmetical operations. For 
computing the other [/' ^(i = 2, ■ ■ ■ , to), it takes to + 1 arithmetical operations. 
For computing C/' ^{s yf t), it takes only 2 operations. Hence, the complexity is 
TO + (m + 1) * (to — 1) + 2 * TO * (to — l)/2 = 2mf — 1 = 0{mf) Hence, the total 
complexity is 0{m?) = 0(n). 

For the third fuzzified relation, =f, unfortunately, we can not find such an 
iterative formula now. 

5 Experimental Results 

The proposed algorithms are implemented in C language on the Sun Sparc Work- 
station 5/110 with SunOS 4.1.4. We design an experiment to test the effectiveness 
of the algorithm. We put a pattern into a data set, we expect our algorithm can 
extract the pattern accurately. Our approach is described in the following. 

There is a data set which contains 10000 tuples generated randomly. The 
domain of the continue attribute is the integer in [0,1000] . We choose an arbitrary 
tuple, say ([914,941],!), as the pattern. Then we duplicate the tuple 10, 100, and 
1000 times respectively into the data set to generate 3 new datasets, DSl, DS2 
and DS3, respectively. By comparing the range in the optimized support rule 
that extracted from the various datasets by the algorithm with the given pattern, 
we see that the extracted optimized range becomes more and more closer to the 
given pattern when we increase the number of that pattern. Table 1 lists the test 
results. 



Table 1: Effectiveness of the algorithm 



No. Intervals 


1000 


500 


200 


100 


DSl 


(-845,-818) 


(-846,-819) 


(-839,-816) 


(-694,-671) 


DS2 


(-26,-2) 


(-24,1) 


(-24,4) 


(-24,-1) 


DS3 


(0, 0) 


(0, 1) 


(-4,-1) 


(-4, -1) 



where (a, b) in the table means the difference of the left and right bounds of the 
two ranges. 

In the dataset DSl, there are only 10 tuples of ([914,941],!). The test showed 
that it is much different with the extracted one. While we increase the number of 
tuple ([914,941],!), the extracted optimized range becomes closer to the pattern. 
When we duplicate 1000 times of that pattern, the extracted optimized range is 
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almost the same as the expected one in all bucket sets. This result shows that 
the iterative algorithm can find effectively the rule from the dataset. 

We also test efficiency of our algorithm. It shows that the running time of 
the iterative algorithm is much better than the naive one. Both algorithms run 
on the same datasets with 10000 tuples and the same thresholds. 



Table 2: Runing Time (in 1/60 seconds) 



No. Intervals 


100 


200 


500 


Naive Algorithm 


1959 


30855 


1673479 


Iterative Algorithm 


70 


104 


353 



6 Conclusions and Future Research 

The contribution of this paper is twofold. First, this paper defined three fuzzified 
relations between range-type data, and represented association rule on range at- 
tribute by using them. Furthermore, we revised the measures for ranking this 
kind of association rules. To our knowledge, it is the first time to discuss range- 
type data in data mining application. Second, we proposed a set of iterative 
formulas for evaluating the measures of association rules. It reduces the time 
complexity from O(n^) of the naive algorithm to 0{n), while the space complex- 
ity is as the same 0(n) as that of the naive one. 

Correlation is an important relationship among attributes. To represent and 
catch correlation in the premise of an association rule in a general form is worth- 
ful to be investigated further. 
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Abstract. Most algorithms for association rule mining are variants of 
the basic Apriori algorithm [2]. One characteristic of these Apriori- 
based algorithms is that candidate itemsets are generated in rounds, 
with the size of the itemsets incremented by one per round. The num- 
ber of database scans required by Apriori-based algorithms thus depends 
on the size of the largest large itemsets. In this paper we devise a more 
general candidate set generation algorithm, LGen, which generates candi- 
date itemsets of multiple sizes during each database scan. We show that, 
given a reasonable set of suggested large itemsets, LGen can significantly 
reduce the number of I/O passes required. In the best cases, only two 
passes are sufficient to discover all the large itemsets irrespective of the 
size of the largest ones. 

Keywords: Data mining, association rules, lattice, Apriori, LGen 



1 Introduction 

Data mining has recently attracted considerable attention from database practi- 
tioners and researchers because of its applicability in many areas, such as decision 
support, market strategy and financial forecasts. Combining techniques from the 
fields of machine learning, statistics and databases, data mining enables us to 
find out useful and invaluable information from huge databases. 

Mining of association rules is a research topic that has received much atten- 
tion among the various data mining problems. Many interesting works have been 
published recently on this problem and its variations [1,2, 4, 5, 7,8, 9]. The retail 
industry provides a classic example application. Typically, a sales database of a 
supermarket keeps, for each transaction, all the items bought in that transac- 
tion and information such as transaction time, customer-id, etc. The association 
rule mining problem is to find out all inference rules such as: “A customer who 
buys item X and item Y is also likely to buy item Z in the same transaction”, 
where X, Y and Z are not known beforehand. Such rules are very useful for 
marketers to develop and to implement customized marketing programs and 
strategies. 



N. Zhong and L. Zhou (Eds.): PAKDD’99, LNAI 1574, pp. 54-64, 1999. 
(c) Springer- Verlag Berlin Heidelberg 1999 
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The problem of mining association rules was first introduced in [1]. In that 
paper, it was shown that the problem could be decomposed into two subprob- 
lems: 

1. Find out all large itemsets and their support counts. A large itemset is a set 
of items which are contained in a sufficiently large number of transactions, 
with respect to a support threshold minimum support. 

2. From the set of large itemsets found, find out all association rules that have 
a confidence value exceeding a confidence threshold minimum confidence. 

Since the solution to the second subproblem is straightforward [2] , major re- 
search efforts have been spent on the first subproblem. Most of the algorithms de- 
vised to find large itemsets are based on the Apriori algorithm [2]. The Apriori 
algorithm finds out the large itemsets iteratively. In the iteration, Apriori 
generates the candidate itemsets of size i ^ from the set of size (« — 1) large item- 
sets Li-i, and scans the database to find the support count of each candidate 
itemset. Apriori terminates when no more candidate set can be generated. 

The key of the Apriori algorithm is the Apriori_Gen function [2] which 
wisely generates only those candidate itemsets that may be large. However, at 
each database scan, only candidate itemsets of the same size are generated. So, 
the number of database scans required by Apriori-based algorithms depends on 
the size of the largest large itemsets. For example, if a database contains a size-10 
large itemset, then at least 10 passes over the database are required. For large 
databases containing gigabytes of transactions, the I/O cost is dauntingly big. 

The goal of this paper is to improve the I/O requirement of the Apriori algo- 
rithm. In particular, we generalize Apriori_Gen to a new candidate set genera- 
tion algorithm, LGen, based on lattice theory. The main idea is to relax Apriori’s 
restriction that candidate itemsets generation must start from size one and that 
at each pass, only candidate itemsets of the same size are generated. Instead, 
LGen takes a (partial) set of multiple-sized large itemsets as a hint to gener- 
ate a set of multiple-sized candidate itemsets. This approach allows us to take 
advantage of an educated guess, or a suggestion, of a set of large itemsets. 

In this paper, we present the LGen candidate set generation function and the 
FindLarge algorithm which uses LGen to discover large itemsets in a database. 
We prove their correctness and show that replacing Apriori and Apriori_Gen 
by FindLarge and LGen allows us to significantly reduce the amount of I/O cost 
required for mining association rules. We study the various properties of the 
algorithms and address the following issues: 

~ FindLarge does not generate more candidate itemsets than Apriori does. 
In general, it generates them earlier and in fewer passes. 

— The set of suggested large itemsets for FindLarge can be found in differ- 
ent ways. Although the suggested set is not mandatory to FindLarge, it 
significantly improves the preformance of the algorithm. 



1 



The size of an itemset is the number of items the itemset contains. 
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1 function FindLa.rge{SuggestedLarge) 

2 Set of large itemsets with associated counters, MaxLargeltemsets ~ 0 

3 Iteration := 1 

4 CandidateSet := (all size-1 itemsets) U (u s^SuggestedLarge ^ 

5 Scan database and count occurrence frequency of every set in CandidateSet 

6 NewLargeltemsets := large itemsets in CandidateSet 

7 while {NewLargeltemsets yf 0) 

8 Iteration := Iteration+1 

9 MaxLargeltemsets := Ka.x{MaxLargeItemsets U NewLargeltemsets) 

10 CandidateSet := LQen{MaxLargeItemsets, Iteration) 

11 Count occurrence frequency of every set in CandidateSet 

12 NewLargeltemsets large itemsets in CandidateSet 

13 end while 

14 return all subsets of elements in MaxLargeltemsets 

Fig. 1. Finding large itemsets 

— The amount of I/O saved by FindLarge over Apriori depends on the ac- 
curacy of the suggested large itemsets. As an extreme case, if the suggested 
itemsets cover all the large itemsets, FindLarge requires only 2 passes. 

— If sampling is used to obtain a set of suggested large itemsets, we observed 
that a small sample is usually sufficient. Sampling plus FindLarge is thus a 
viable option for fast association rule mining. 



2 LGen 



We generalize Apriori_Gen to a new candidate set generation function called 
LGen based on lattice theory. The main idea of LGen is to generate candidate 
itemsets of bigger sizes early using information provided by a set of suggested 
large itemsets. Before we describe our algorithm formally, let us first illustrate 
the idea with an example. 

(Example 1) Consider a database whose large itemsets are {a, 6, c, d, e, /}, 
{d,e, /,(?}, {e,f,g,h}, {h,i,j} and all their subsets. Assume that the itemsets 
{a, 6, d, e, /}, {e, /, g, h}, and {d, g} are suggested large. During the first iteration, 
we count the supports of the singletons as well as those of the suggested itemsets. 
Assume that the suggested itemsets are verified large in the first iteration. In 
the second iteration, since {a, b, d, e, /} is large, we know that its subset {d, e} is 
also large. Similarly, we can infer from {e, /, g, h} that {e, g} is also large. Since 
{d,g} is also large, we can generate the candidate itemset {d,e,g} and start 
counting it. Similarly, the candidate itemset {d,f,g} can also be generated this 
way. Therefore, we have generated some size-3 candidate itemsets before we find 
out all size- two large itemsets. 

Our algorithm for finding large itemsets, FindLarge, is shown in Figure 1. 
The method is similar to Apriori except that: 
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— it takes a set of suggested itemsets, SuggestedLarge as input and counts their 

supports during the first database scan, and 
~ it replaces the Apriori_Gen function by the more general LGen function 

which takes the set of maximal large itemsets {MaxLargeltemsets) as input. 

The algorithm consists of two stages. The first stage consists of a single 
database scan (lines 4-6). Singletons, as well as the suggested large itemsets and 
their subsets {{JsGSuggestedLarge i ^here | s = {s | s C s}) , are counted. Any 
itemset found to be large at this stage is put into the set of newly found large 
itemsets {NewLargeltemsets). 

The second stage of FindLarge is iterative. The iteration continues until no 
more new large itemsets can be found. At each iteration, FindLarge generates a 
set of candidate itemsets based on the large itemsets it has already discovered. As 
discussed in [11], we could apply Apriori_Gen on the whole set of large itemsets 
already found. However, the drawback is that the set of large itemsets could be 
large, and that it would result in the generation of many redundant candidate 
itemsets. Instead, FindLarge first canonicalizes the set of large itemsets into a 
set of maximal large itemsets {MaxLargeltemsets) , and passes the maximal set 
to LGen to generate candidate itemsets. The function Max() (line 9) performs the 
canonicalization . 

We can consider canonicalization as a way of compressing the information 
contained in the set of large itemsets. Suppose we know that a set s is large, we 
immediately know that all of its subsets are also large. Considering the set of 
itemsets with the subset operator as a lattice, we borrow notations from lattice 
theory [.3] [6] and denote the set of all subsets of s by its downset | s. Defining 
the set of maximal elements of L as max(L) = {x G L | Vy G L [x C y 
y C x]}, we can represent the set of all large itemsets L by a union of downsets: 

^ Usemax(L) ^ 

In Example 1 (page 56) where {a, 6, c, d, e, /}, {d, e, /,g}, {e, /,g,h}, {h,i,j} 
and all their subsets are large, max(L) = {{a, b, c, d, e, /}, {d, e, /, g}, {e, /, g, h}, 
{h,i,j}}. Hence, only 4 itemsets are needed to represent L, which contains 86 
large itemsets (including the null set). 

The set of maximal large itemsets found is then passed to LGen to gener- 
ate candidate itemsets (line 10). The crux is how to do the generation based 
on the compressed maximals only. We remark that the Apriori algorithm with 
Apriori_Gen is in fact displaying a special case of candidate generation with 
canonicalization. Recall that in Apriori, at the beginning of the the n-th itera- 
tion, the set of large itemsets already discovered is Ufe=i The set of elements 
with size (n — 1) in the canonicalized set rnax([J^^j^ L^) is Ln-i- Interestingly, 
Apriori_Gen generates the candidate set C„ based solely on L„_i. 

The function LGen is shown in Figure 2. By simple induction, one can show 
that at the beginning of the n-th iteration of FindLarge, all large itemsets whose 
sizes are smaller than n are known. Hence, when LGen is called at the n-th it- 
eration of FindLarge, it only generates candidate itemsets that are of size n or 
larger. To generate fix-sized candidate itemsets, LGen calls the helper function, 
LGFixedSize (Figure 3). Essentially, given a target candidate itemset size n, 
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1 function LGen{MaxLargeItemsets,n) 

2 Candidates et~ 0 

3 repeat 

4 NewCand— LGFixedSize{MaxLargeItemsets,n) 

5 Candidates et:= CandidateSet U NewCand 

6 n := n + 1 

7 until [NewCand = 0) 

8 return CandidateSet 



Fig. 2. Generating candidate itemsets for a certain iteration 



1 

2 

3 

4 



5 



6 

7 

8 
9 



function LGFixedSize{MaxLargeItemsets,n) 
CandidateSet ~ 0 

foreach i,j £ MaxLargeltemsets, i ^ j, 
if (l(*ni)| > n-2) 



NewCand := 



< {ai,a2,...,an} 



ai, 02, a„_2 G in 
On— 1 G i On G i, 

Vi G [l,n] G MaxLargeltemsets 

S.t.{oi,02, ...,On} - {Oi} C t 
€ MaxLargeltemsets s.t. 

{oi, 02, ..., On} C u 



CandidateSet := CandidateSet U NewCand 
end if 
end foreach 
return CandidateSet 






Fig. 3. Generating candidate itemsets of a fixed size 



LGFixedSize examines every pair of maximal large itemsets i, j whose intersec- 
tion is at least n — 2 in size (lines 3-4) . It then generates candidate itemsets of 
size n by picking n — 2 items from the intersection between i and j, one item 
from the set difference i — j, and another item from j — i. A candidate itemset 
so generated is then checked to see if all of its size-(n — 1) subsets are already 
known to be large. (That is, if all of them are subsets of certain maximal large 
itemsets.) If not, the candidate itemset is discarded. The candidate itemsets so 
generated are collected in the set NewCand as shown in line 5 of Figure 3. 



2.1 Theorems 

We summarize a few properties of LGen in the following theorems. For the proofs, 
please refer to [11]. We use the symbol S to represent the set of suggested 
large itemsets and [ y to represent the downset of any itemset y (i.e., | y = 
{x \ X C y}). So UsgS i s is the set of all itemsets suggested implicitly or 
explicitly by the suggested set S. Also, we use CLCen to represent the set of all 
candidate itemsets generated by LGen in F indLarge and CApriori to represent the 
set of all candidate itemsets generated by Apriori_Gen in the Apriori algorithm. 

Theorem 1. Given a set of suggested large itemsets S, CApriori Q CLCen U 
(UsgS -1- ^)- 
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Since FindLarge (which uses LGen) counts the supports of all itemsets in 
CLGenU(lJggg | s) , Theorem 1 says that any candidate itemset that is generated 
by Apriori will have its support counted by FindLarge. Hence, if Apriori finds 
out all large itemsets in the database, FindLarge does too. In other words, 
FindLarge is correct. 

Theorem 2. ClGcu ^ C Apriori- 

Theorem 2 says that the set of candidate itemsets generated by LGen is a subset 
of that generated by Apriori_Gen. LGen thus does not generate any unnecessary 
candidate itemsets and waste resources in counting bogus ones. However, recall 
that FindLarge counts the supports of the suggested large itemsets in the first 
database scan for verification. Therefore, if the suggested set contains itemsets 
that are actually small, FindLarge will count their supports superfluously. Fortu- 
nately, the number of large itemsets in a database is usually order-of-magnitude 
fewer than the number of candidate itemsets. The extra support counting is thus 
insignificant compared with the support counting of all the candidate itemsets. 
FindLarge using LGen thus requires similar counting effort as Apriori does. 

Theorem 3. If S = % then ClGou = C Apriori- 

Theorem 3 says that without suggested large itemsets, FindLarge reduces to 
Apriori. In particular, they generate exactly the same set of candidate itemsets. 

3 Experiments 

To evaluate the performance of FindLarge using LGen as the candidate set gen- 
eration algorithm, we performed extensive simulation experiments. Our goals 
are to study the I/O savings that can be achieved by FindLarge over Apriori, 
and how sampling can be used to obtain a good set of suggested large itemsets. 
In this section, we present some representative results from the experiments. 

3.1 Synthetic Database Generation 

In the experiments, we used the model of [8] to generate synthetic data as the 
test databases. Due to space limitations, readers are referred to [8] and [11] for 
details. 



3.2 Coverages and I/O Savings 

For each database instance generated, we first discovered the set of large item- 
sets, L, using Apriori. Our first set of experiments studies how the “coverage” 
of the suggested large itemsets affects the performance of FindLarge. By cover- 
age, we mean the fraction of large itemsets in L that are suggested.^ To model 

If an itemset is suggested, then all of its subsets are also implicitly suggested. 
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coverage, we drew a raniio sample from L to form a set of suggested large 
itemsets S. We define the coverage of a suggested set S over the set of large 
itemsets L by 

I UsG(SnL) -1- ■®l 

coverage = |-^-j . 

Since in our first set of experiments, S was drawn from L, we have S H L = S . 
After we had constructed S, we ran FindLarge on the database using S as the 
suggested set. Finally, we compared the number of I/O passes each algorithm 
had taken. Note that with the way we generated the suggested set S, no ele- 
ment in S was small. In practice, however, the suggested itemsets could contain 
small itemsets. Since small suggested itemsets are discarded in the first pass of 
FindLarge (see Figure 1), their presence does not affect the number of I/O passes 
required by FindLarge. So they were not modeled in this set of experiments. 




Fig. 4. Number of I/O passes vs. cover- Fig. 5. A representative I/O vs. cover- 
age under different support threshold, age curve. 



We generated a number of database instances according to the model men- 
tioned above. For each database instance, we applied Apriori to find the large 
itemsets under different support thresholds. Also, we generated a number of 
suggested sets of various coverages. We then applied FindLarge to the different 
database instances with different suggested sets under different support thresh- 
olds. The number of I/O passes that Apriori and FindLarge had taken were 
then compared. Figure 4 shows the result obtained from one typical database 
instance. In the following we refer to this particular database as T>. 

In Figure 4, three sets of points (‘O’, ‘+\ ‘D’) are shown corresponding to 
the support thresholds 1%, 0.75%, and 0.5% respectively. Each point shows the 
number of I/O passes FindLarge took when applied to T> with a particular sug- 
gested set of a certain coverage. For example, the □ point labeled A shows that 
when FindLarge was applied with a suggested set whose coverage was 81.6%, 6 
passes were required. Note that the points shown in Figure 4 take on discrete 
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values. The lines connecting points of the same kind are there for legibility reason 
only and should not be interpreted as interpolation. 

For the database T>, when the support threshold was set to either 1.0% 
or 0.75%, the size of the largest large itemsets was 9. Apriori took 91/0 passes. 
This is shown in Figure 4 by the ‘O’ and ‘-I-’ points when coverage equals 0. (Re- 
call that FindLarge reduces to Apriori when the suggested set is null.) When 
the support threshold was lowered to 0.5%, the support counts of certain size-10 
itemsets also exceeded the support threshold. In that case, the size of the largest 
large itemsets was 10. Apriori thus took 10 I/O passes, as shown by the ‘D’ 
points when coverage equals 0. 

One general observation from Figure 4 is that the higher the coverage of 
the suggested set has with respect to the set of large itemsets, the smaller the 
number of I/O passes FindLarge is required. In fact, all of the data points we 
obtained from our experiment exhibit a typical curve as shown in Figure 5. 

In general, we can divide the curve into four stages: 

At point a (coverage = 0). When the suggested set does not cover any 
large itemsets, FindLarge with LGen degenerates to Apriori. The number of 
I/O passes required by the two algorithms are thus the same. 

Between points a and b. In this region, FindLarge takes the same number 
of passes as Apriori does. With a very small coverage, there are only few large 
itemsets suggested, and these itemsets usually consist of only a small number of 
items. ^ In this case, LGen is unable to provide the advantage of generating large- 
sized candidate itemsets early. Hence, no saving in I/O is obtained. The length 
of the line ab, fortunately, is usually small. In our experiments, for example, the 
line ab in general spans only from coverage = 0% to 20%. As we will see later, a 
suggested set with a larger-than-20% coverage is easily obtainable by sampling 
techniques. 

Between points b and c. The number of I/O passes required by FindLarge 
decreases gradually as the coverage of the suggested set increases. This is be- 
cause as more large-sized itemsets are suggested, LGen is better able to generate 
large-sized candidate itemsets early. As an example, when mining the database 
instance T> with the support threshold set at 1% using a suggested set whose cov- 
erage is 0.743 (point labeled B in Figure 4), LGen generated candidate itemsets 
of sizes ranging from 2 to 7 early in pass number 2. 

We also observe that the amount of I/O saving increases more rapidly when 
the coverage is approaching 1. This is because with a very large coverage, the 
suggested set contains many top-sized, or maximal large itemsets. This greatly 
facilitates the generation of other not-yet-discovered maximal large itemsets as 
candidates early. Since FindLarge terminates once all the maximals are found, 
only very few passes are required. 

At point c (coverage = 100%). FindLarge only needs two passes over the 
database when the suggested set covers all large itemsets. The first pass counts 
the support of the suggested itemsets. Since they are the only large itemsets 



® A size-k itemset suggested implicitly suggests 2*’ — 1 non-empty itemsets. So if the 
suggested set contains large-sized itemsets, it would possess a good coverage. 
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Pass 


Apriori 


FindLarge with 
support=l%, 
coverage=8% 


FindLarge with 
support=l%, 
coverage=74% 


FindLarge with 
support =1%, 
coverage=98% 


1 


1,000 


1,234 


3,461 


4,118 


2 


220,780 


220,687 


220,335 


220,173 


3 


937 


858 


366 


56 


4 


777 


732 


159 


- 


5 


519 


504 


23 


- 


6 


229 


227 


3 


- 


7 


84 


84 


- 


- 


8 


19 


19 


- 


- 


9 


2 


2 


- 


- 


Total 


224,347 


224,347 


224,347 


224,347 



Table 1. Candidate set size for each pass when database V is mined using 
Apriori and FindLarge. 



in the database, LGen will generate the negative border‘d [10] as candidates. In 
the second pass, FindLarge counts the supports of the candidate itemsets. Since 
none of the candidate is large, FindLarge terminates after only two passes. 

3.3 Candidate Set Sizes 

As discussed in Section 2, FindLarge checks the same number of candidate 
itemsets as Apriori does, but in fewer passes. As an example, we mined the 
database T> using Apriori and FindLarge with three suggested sets of differ- 
ent coverages (8%, 74%, and 98%). Note that the numbers shown in Table 1 
equal the candidate sets’ sizes except for those listed under the first pass of 
FindLarge. This is because FindLarge counts the supports of all subsets of the 
suggested itemsets in the first pass, besides those of the singletons. These sup- 
port counts are also included in the table. For example, during the first pass 
when FindLarge mined the database V with a suggested set S of coverage 8%, 
it counted the supports of 1,000 singletons as well as 234 itemsets suggested 
(explicitly or implicitly) by S. 

From the table, we observe that the candidate set size for the second pass 
dominates the others, and thus determines the memory requirement. FindLarge 
redistributes the counting work and, in general, generates fewer candidate item- 
sets in the second pass as Apriori does. Hence, the memory requirement of 
FindLarge is comparable to that of Apriori. 

3.4 Sampling 

We performed a set of simulation experiments to study how sampling should be 
done in order to obtain a good suggested set. In the experiments, a number of 

^ The negative border is the set of itemsets which are small but all of their subsets 
are large. 
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databases were generated. For each database, we extracted a fraction / of trans- 
actions as samples. We then mined the samples using Apriori. The resulting 
large itemsets discovered were then used as the suggested set for FindLarge. 

We observed that even with a small set of samples (e.g. / = 1/16), the 
expected coverage of the suggested set derived exceeded 90%. Also, FindLarge 
saved 31% of I/O compared with Apriori. For details, please refer to [11]. 



4 Conclusion 

This paper described a new algorithm FindLarge for finding large itemsets in a 
transaction database. FindLarge uses a new candidate set generation algorithm 
LGen which takes a set of multiple-sized large itemsets to generate multiple-sized 
candidate itemsets. Given a reasonably accurate suggested set, LGen allows big- 
sized candidate itemsets to be generated and processed early. This results in sig- 
nificant I/O saving compared with traditional Apriori-based mining algorithms. 

We stated a number of theorems about FindLarge and LGen. In particular, 
FindLarge is correct and LGen never generates redundant candidate itemsets. 
Hence the CPU requirement of FindLarge is compatible with Apriori. Detailed 
proof of the theorems can be found in [11]. 

In order to evaluate the I/O performance of FindLarge, we conducted ex- 
tensive experiments. We showed that the better coverage the suggested set has, 
the fewer I/O passes FindLarge requires. In the best case, when the suggested 
set covers all large itemsets, FindLarge takes only two passes over the database. 

To obtain a good suggested set, sampling techniques can be applied. We 
showed that a small sample is usually sufficient to generate a suggested set of 
high coverage. FindLarge is thus an efficient and practical algorithm for mining 
association rules. 
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Abstract. An important issue that needs to be addressed when using 
association rules is the validity of the rules for new situations. Rules are 
typically derived from the patterns in a particular dataset. When the 
conditions under which the dataset has been obtained change, a new sit- 
uation is said to have risen. Since the conditions existing at the time of 
observation could affect the observed data, a change in those conditions 
could imply a changed set of rules for a new situation. Using the set of 
rules derived from the dataset for an earlier situation could lead to wrong 
decisions. In this paper, we provide a model explaining the difference be- 
tween the sets of rules for different situations. Our model is based on 
the concept of rule-generating groups that we call caucuses. Using this 
model, we provide a simple technique, called Linear Combinations, to 
get a good estimate of the set of rules for a new situation. Our approach 
is independent of the core mining process, and so can be easily imple- 
mented with any specific technique for association rule mining. In our 
experiments using controlled datasets, we found that we could get up to 
98.3% accuracy with our techniques as opposed to 26.6% when directly 
using the results of the old situation. 

Keywords: Extending association rules. Linear Combinations, rule- 
generating model. 



1 Introduction 

Association rule mining is a valuable data mining technique and researchers have 
studied many aspects of the technique. However, the issue of the validity of the 
rules for new situations has not been addressed so far. By a new situation, we 
mean a change in the circumstances that existed when the data was collected. 
In general, the rules are derived from patterns in a dataset that corresponds to a 
particular situation. However, a decision has to be made, based on the rules, at a 
later instance - which could be a new situation (under changed circumstances). 
When the rules are influenced by the existing set of conditions, direct application 
of the rules from the old situation (corresponding to a different set of conditions) 
could lead to wrong decisions. 

Consider the following real-world example. A retail chain is planning to open 
a store in a new location. Decisions have to be made regarding the inventory 
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and shop-space allocation for the new store - a job for which association rules 
between the items for sale would be very useful. However, the only available rules 
are from the sales information of already-operational stores. How dependable are 
the rules from these stores from locations different from the new store ? Is there 
a way to obtain a “correct” set of rules for the new store, without having to wait 
for the sales transactions from the new store? 

We address the above issues in this paper, through the following contribu- 
tions: 

— We introduce the concept of a set of rule-generating groups, called caucuses, 
that are common to different situations. We provide a model for association 
rules for different situations based on the concept of caucuses. This model 
provides an analytical framework for explaining the differences between the 
sets of rules for different situations. 

— We provide a simple technique, called Linear Combinations, that gives the 
estimated set of rules for a new situation. It can be used with any available 
association rule mining approach as it requires no modification to the core 
mining algorithm. 

— We demonstrate the effectiveness of our technique over just using the rules 
from the available dataset for the new situation. Our experiments using con- 
trolled datasets (generated by a common synthetic-data generation approach 
for association rule mining) show that as the differences between the indi- 
vidual caucuses increased, the mismatches in the rules for the two different 
situations also increased. When just using the rules from the available (old) 
dataset directly the accuracy for correctly predicting rules for a new situation 
varied from 82.9% to 26.7% (from similar caucuses to totally different cau- 
cuses). While with our Linear Combinations approach, the accuracy varied 
between 96.7% and 98.3%. 

The rest of the paper is organized as follows. Section 2 describes the general 
problem of association rule mining. Section 3 describes our problem in more 
detail and our Linear Combinations approach for solving it. Section 4 explains 
our experiments and results. In section 5, we discuss an alternative approach and 
explain why such an approach would not be feasible for our particular problem. 
Section 6 provides the conclusions. 



2 Background on Association Rnle Mining 

This section is based largely on the description of association rule mining given 
by Agrawal et.al. [1,2]. Let I = {Ii, I 2 , ■■■, In} be the domain of literals called 
items. A record called a transaction contains a set of items Ii, I 2 , ■■■, Ik C I. The 
input to the association rule mining algorithm is a set of transactions, T. We 
call any set of items Ii, I 2 , 7m C I collectively an itemset. An association rule 
is a relation of the form A B in T, where A, B are itemsets, and A n H = 0. 
A is the antecedent of the rule and B is the consequent of the rule. 
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An itemset has a measure of statistical significance associated with it called 
support. Support for an itemset X in T (support(X)) is the number of trans- 
actions in T containing X. For a rule, A B, the associated support is 
support(A B). A support fraction is the ratio of the support value to the total 
number of transactions. The strength of a rule is given by another measure called 
confidence. The confidence of A B is the ratio support(A U B) / support(A). 

The problem of association rule mining is to generate all rules that have sup- 
port and confidence greater than some user-specified thresholds. Itemsets that 
have support greater than the user-specified support (or have a corresponding 
support fraction in the subset of the data under consideration) are called large 
itemsets. For a large itemset S, if A C S' and support (S )/ support (A) > confidence 
threshold, then A => S — A is an association rule. The problem of association 
rule mining is, thus, broken down into: 

(a) The task of determining all large itemsets. 

(b) The task of determining the rules with enough confidence, from the large 
itemsets. 

In association rule mining, the input data is used in the task of determining 
the large itemsets. The process of determining the large itemsets captures all the 
necessary information for association rules from the data. The rule-generation 
portion is then dependent only on the large itemsets. This implies that in order 
to obtain a correct estimate for a set of rules, a correct estimate for the large 
itemsets (from which the rules are generated) is both necessary and sufficient. 

3 Problem Description and Solution 

3.1 Importance of Background Attributes 

The importance of the rules identified by association rule mining (or for that 
matter any rule mining technique) is often decided by the applicability of the 
discovered rules. Consider the example depicted in Fig. 1. It shows association 
rules discovered at two stores having the same inventory, but located in different 
neighborhoods. The rule diaper beer, is valid in the first store but not in the 
second, while diaper ice cream is valid only in the second store. However, the 
price of beer, ice cream and diapers and the conditions for their purchases are 
similar in both stores. Further study shows that 80% of the customers for store 1 
are men and 70% of the customers for store 2 are women, and that the men were 
found to have a strong preference for beer and the women had a preference for 
ice cream. Only then do the discrepancies in the rules make sense. 

Here, gender is a background attribute - an attribute not directly involved in 
the association rule - and is a causative factor for the two rules. The distribution 
of the different values of gender (male and female) was different among the 
customers at the two stores, giving rise to different sets of association rules 
for them. When applying association rules derived from one dataset to a new 
situation, it is necessary to ensure that the background attributes, which are 
factors giving rise to the rules, are consistent between the source of the rules 
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and the situation they are being applied to. When such factors can be identified 
and are found to be different, that knowledge has to be incorporated into the 
data mining procedure for providing a better set of rules for the new situation. 



Parameters common for both stores : 

Total transactions = 1000, Support threshold = 50, Confidence threshold = 40%. 
Price: Diapers = $ 10, Beer = $ 4, ice-cream = $ 3. 

STORE 1 : diaper beer (confidence = 42%). 

STORE 2 

Proposed rule : diaper beer (confidence = 28% < 40%, not valid). 

Alternate rule : diaper => ice cream (confidence = 43%, valid rule). 

Fig. 1. Example showing variation of rules for two stores 

3.2 Problem Definition 

We first introduce and define a basic set of terms used in the rest of the paper. 
A background attribute whose value affects the relation between the items is 
called a factor. In general, a set of factors with specific values (or sets of values) 
together affect the relation between items. When the factors are demographic 
characteristics, they would identify a particular group of people having similar 
characteristics. For instance, gender would be a factor in the example considered 
above - male and female, would be the associated values that identify a group 
of people having similar characteristics, and affect the relation between diapers, 
beer and ice-cream. Members of a group have a similar behavior on their part 
producing the same relation for the items they are involved with e.g. men, in 
general, like beer. This causes them to buy beer when they buy diapers, rather 
than ice-cream. 

A set of factors together with their associated values / value-ranges that cause 
specific sets of rules is called a caucus. Each caucus is always defined by distinct 
sets of values for the factors affecting the rules. Our model for the generation of 
association rules is as follows. Every caucus has a strong affinity for producing 
a certain set of association rules (or associated items). Independent of which 
situation a caucus is found in, the set of rules (or associated items) that it has 
affinity for does not change. Each situation has a mixture of different caucuses 
in some specific proportion. The difference in the proportions of the caucuses 
for different situations gives rise to the difference between the overall set of 
association rules for each situation. This is the fundamental idea in our approach. 
So far, we have used the conventional meaning for the term situation wherever 
we have used it. We could continue to use it in the same fashion. However, in 
addition to that, it would also refer to the specific combination of caucuses that 
are associated with it. Every situation is also associated with its own distinct 
dataset, the patterns in which are caused by the caucuses associated with that 
situation. 

We now define the problem of estimating association rules as follows. Given, 
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1. the set of caucuses that make up two situations, 

2. the proportions of the caucuses for the second situation, and 

3. a source dataset - a database of transactions involving items with corre- 
sponding values for all the factors, for the first situation; 

the task is to provide a good estimate of the association rules for the second 
situation, also referred to as the target. 

Since the data-dependent portion of the rule mining process is just the phase 
when the large itemsets are determined (refer section 2), determining a good 
estimate of the large itemsets (including the right values for their support) for 
a new situation is equivalent to determining a good estimate of the rules for it. 
So we provide a method. Linear Combinations, to obtain a good estimate of the 
large itemsets for new situations. 

3.3 Linear Combinations 

Our approach to estimating the itemsets for the new situation is a straight- 
forward application of the method of linear combinations. Every caucus for a 
situation defines a distinct region in the multi-dimensional factor-space. The 
values for the factors in every transaction are used to map that transaction to a 
particular caucus. For every caucus, the set of large itemsets (along with their 
fractional support in the caucus) is determined from the set of transactions asso- 
ciated with it. The union of the set of large itemsets in every caucus is a superset 
of the set of itemsets that are large in the entire dataset (that is because, an 
itemset that has adequate fractional support in a dataset, would need to have 
adequate fractional support in at least one of the caucuses). We call this union 
of the set of large itemsets of every caucus as the union-set. For any dataset 
whose transactions are composed of transactions of the same set of caucuses, 
this union-set could be treated as the candidate set for the dataset. Thus, only 
the support of all the sets in this union-set need to be estimated, when deter- 
mining the large itemsets for the dataset of any situation that has the same set 
of caucuses. 

Using the dataset for the first situation (source), the large itemset determina- 
tion phase is performed for the transactions of every caucus separately. After this 
phase, the support of every potentially large itemset is available for each caucus. 
The support of a large itemset in the second situation (target) is estimated from 
its support in the different caucuses of the source and the relative proportion of 
these caucuses in the target. Given N caucuses, Li - set of large itemsets for 
each caucus i, and Ni - strength (fraction) of each caucus in the target, with 
S(L)i as the notation for the support fraction of itemset L in caucus i, we have: 



The large itemsets for the target are those that have the estimated support 
greater than the specified support threshold. The caveat in the above approach 



Union-set U = Li. 

i=l.N 

For each L G U, estimated support fraction in target 




N 



( 1 ) 

( 2 ) 
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is the requirement of the availability of representative samples for each caucus 
(that can occur in a target) in the source situation. However, this is an under- 
standable requirement, as one can hardly be expected to predict the behavior 
of a group without any information about the group. The simplicity of our ap- 
proach is tied to its ability to be used independent of the core mining process. 
Thus, the approach can be applied to the vast number of association rule mining 
techniques [3,4,5,6,7,8,9,10], each of which is different in the nature of items it 
handles or the approach taken to identifying the large itemsets, with a little or 
no modification. 

4 Experimental Results 

The data-dependent (situation-dependent) phase of association rule mining is 
the one when large itemsets are determined. The quality of the estimated large 
itemsets directly determines the quality of the estimated rules for new situa- 
tions. Hence, we demonstrate the efficiency of the techniques from the quality of 
large itemsets generated. Using the data generation procedure described below 
we generate two datasets, one for the source and another for the target. Both 
datasets are mined to obtain the correct set of large itemsets for them - Lg 
and Lt for the source and target, respectively. We then estimate the large item- 
sets for the target, Tlc, using our technique and the source dataset. The extent 
of matching of Lt with Lg determines the accuracy of using just the source 
dataset, and the extent of its matching with L^c determines the accuracy of 
our approach. 

4.1 Experiments 

The comparison between the correct set of large itemsets and the estimated set 
of large itemsets is based on two factors: 

~ the actual items in the itemsets, and 

— the support values associated with the matching itemsets. 

Further, to detect any variation in the efficiency of the technique with the extent 
of differences between the caucuses for the situations, we study three conditions: 

(a) All the caucuses involve the same set of items ~ similar caucuses, 

(b) Half the items of each caucus are unique to it, the rest are common to all. 

(c) All the items for a caucus are unique to it - the most dissimilar caucuses. 

4.2 Data Generation 

To examine all three conditions we use synthetic datasets in our experiments. 
We use a popular method for generating transactions containing associations, 
first presented by Agrawal and Srikant [2] . We modify the method to include the 
presence of multiple caucuses in a single dataset. We generate separate datasets 
for the source and target situations. We generate a set of G caucuses that will 
be common for both situations - this involves the assignment of items to each 
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caucus, and the generation of maximal and potentially large itemsets for each 
caucus from its items - these two together determine the difference between the 
large itemsets/rules of the different caucuses. Next we assign an independent set 
of weights as the relative proportions of the caucuses for each situation. This 
gives rise to the overall difference between the large itemsets/rules of the two 
datasets. For all three cases the total number of items for each caucus is set 
to 100 as indicated in Fig. 2. We use a support value of 0.5 % in our experiments 
(a reasonable value for items in retail transactions). Figure 2 gives the various 
input parameters and the values used for them in our experiments. 



G 


Number of caucuses 


25 


- 


Number of items in a caucus 


100 


- 


number of common items between caucuses 


0, 50, 100 


I 


Average size of an itemset (Poisson distribution) 


4 


L 


Number of itemsets for a caucus 


500 


T 


Average size of a transaction (Poisson distribution) 


10 


N 


Total Number of transactions in a dataset 


100000 



Fig. 2. Parameters for data generation 



4.3 Platform 

We performed our experiments on a Dell workstation with a SOOMhz Pentium II 
processor and 128 MB of memory. Our datasets were stored as relational tables 
in a DB2 Universal Database system running on top of NT Server 4. We used the 
OR-extensions of DB2 [11] to build our data mining applications - for association 
rule mining (similar to the approach in [12]), and for caucus- wise rule mining - 
in an efficient manner. 



4.4 Results 



Same Items 



50 same, 50 different 



Distinct Items 




(a) 



(b) 



(c) 



Fig. 3. Distribution of predicted large itemsets (correct and excess/incorrect) 



Figure 3 presents the charts showing the relative accuracy of the different 
approaches in terms of the number of large itemsets that are correctly predicted 
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by them. Figure 3(a) shows the results for the case when all caucuses involve 
the same set of 100 items, figure 3(b) shows the case when there are 50 distinct 
items per caucus with the other 50 items being common to all, and Fig. 3(c) 
shows the case when all the caucuses have their own distinct set of 100 items. 
The y-axis shows the number of large itemsets. Target represents the number of 
itemsets of the target for comparison with the other two bars. Source stands for 
the case when large itemsets of the source are used directly. LC stands for the 
case when estimates for the target are obtained using Linear Combination. For 
each case, the lower component of the corresponding bar indicates the number of 
correct large itemsets predicted with that technique and the higher component 
indicates the number of excess (incorrect) itemsets predicted by that technique. 

The percentage of correct large itemsets captured with the linear combina- 
tions approach varies from 96.7% to 98.3% and the excess itemsets from 4% 
to 1.8%. In contrast, directly using the large itemsets of the source results in the 
percentage of correct itemsets varying from 82.9% to 26.6%, and the percentage 
of excess itemsets varying from 16.3% to 59.7%. The error with Source increases 
as we move from similar caucuses to more distinct caucuses ((a) to (c)). 



50 same, SO different 





Source LC 



Source LC 



Source LC 



(a) (b) (c) 

Fig. 4. Distribution of error in support values for correctly predicted large item- 
sets 



Figure 4 shows the distribution of the percentage error in support values for 
the correctly predicted itemsets. The bars marked Source and LC are associated 
with the direct usage of source’s large itemsets and the Linear Combinations 
approach, respectively. Each bar is broken down into many components, where 
each component is proportional to the number of itemsets with a particular error 
percentage (range) in the support value. The legend specifies each component 
that corresponds to a specific range in the support error percentage. It can be 
seen that for the Linear Combinations approach the percentage error in the 
support values is almost always less than 20% (and mostly less then 10%), while 
when using the large itemsets of the source there are plenty of itemsets with 
large errors in the support values. 

The results indicate that, in the presence of different proportions of the cau- 
cuses for different situations, using Linear Combination improves the accuracy 
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of estimated association rules quite dramatically. Our technique is not only quite 
accurate in capturing the correct set of itemsets, but even the values for their 
actual support. It also seems to be equally effective across the entire range of 
item-composition for the caucuses - from identical items to completely distinct 
sets of items for each caucus. In contrast, we see that the error on not using 
Linear Combinations increases as the item-compositions of the caucuses become 
more and more distinct from one another. 

5 Alternative Approach 

An alternative approach to that of using the notion of caucuses is one where the 
rule generating process is modified by including the causative factors in the core 
mining process. The idea behind this would be that one would obtain rules that 
are qualified by the values of the corresponding factors for those rules e.g. {For 
Men: Diaper beer), and {For Women: Diaper ice-cream). Since the rules 
are conditioned by the values for the factors, knowledge of any change in the 
distribution of these values for new situations can be used in conjunction with 
these qualifiers to identify the correct set of rules. The difficulties in realizing 
this approach are : 

— Additional attributes could make the computational cost increase exponen- 
tially. 

— Difference in nature and type of these factors from the original set of items 
could require significant changes to the rule mining algorithm to include 
them. 

— Factors can be highly correlated - could lead to a large number of rules 
involving them, many of which could be potentially useless. 

Additionally, the nature of association between the factors and items could 
be very different from those between the factors or those between the items. This 
implies the usage of different metrics/methods for the identification of patterns 
between the factors and items (causative rules for which quite different methods 
are used [13,14]), and for patterns between the items alone. Further, the qualifi- 
cations for a rule, in terms of values for the causative factors, might need to be 
simplified or eliminated (customized for the new situation) when trying to get 
the whole picture of the set of rules for a new situation. However, in the absence 
of any knowledge about the caucuses or any method to identify them, this al- 
ternate approach might be a reasonable method for obtaining a more informed 
set of rules (than the ones involving just the items). 

6 Conclusions 

Researchers in mining association rules have focused on aspects of the mining 
process related to the discovery of rules from a dataset corresponding to a par- 
ticular situation. However, the ability to identify the association rules for new 
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situations has an equally important role in knowledge discovery and decision 
support systems. In this paper, we provide a model that distinguishes between 
the rules for different situations using the concept of association-rule-generating 
groups or caucuses. Using this model, given the dataset for one situation, we 
provide a simple approach for estimating the set of rules for the second situa- 
tion. Our approach, Linear Combinations, requires no modification to the core 
mining process, and can be used to extend the rules from all kinds of associa- 
tion rule mining techniques. We show the errors in directly applying association 
rules derived from the dataset of one situation to another different situation. We 
demonstrate the effectiveness of the Linear Combinations approach in deriving 
the rules for the second situation with a high degree of accuracy. We conclude 
that our approach provides a simple, but powerful, means for correctly extending 
the application of association rules from one situation to another. 

We are currently working on getting access to real-world datasets that might 
exhibit additional complexities to fine-tune our technique. We are also looking 
at techniques for the automatic identification of the causative factors and the 
caucuses. 
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Abstract. In this paper, we study the issue of maintaining association 
rules in a large database of sales transactions. The maintenance of 
association rules can be mapped into the problem of maintaining large 
itemsets in the database. Because the mining of association rules is 
time-consuming, we need an efficient approach to maintain the large 
itemsets when the database is updated. In this paper, we present 
efficient approaches to solve the problem. Our approaches store the 
itemsets that are not large at present but may become large itemsets 
after updating the database, so that the cost of processing the updated 
database can be reduced. Moreover, we discuss the cases where the 
large itemsets can be obtained without scanning the original database. 
Experimental results show that our algorithms outperform other 
algorithms, especially when the original database need not be scanned 
in our algorithms. 



1 Introduction 

Data mining has attracted much attention in database communities because of its wide 
applicability. One major application area of data mining is to discover potentially 
useful information from transaction databases. The problem of mining association 
rules from transactional data was introduced in [1]. A transaction in the database 
consists of a set of items (itemset). An example of such an association rule might be “ 
80% of customers who buy itemset X also buy itemset Y”. The support count of an 
itemset is the number of transactions containing the itemset, and the support of the 
itemset is the fraction of those transactions. The itemset (XuY in the example) 
involved in an association rule must be contained in a predetermined number of 
transactions. The predetermined number of transactions is called the minimum support 
count, and the fraction of transactions is called the minimum support threshold. An 
itemset is called a large itemset if its support is no less than the minimum support 
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threshold. Since the generation of association rules is straightforward after the large 
itemsets are discovered, finding large itemsets becomes the main work of mining 
association rules. 

A database is “dynamic” in the sense that users may periodically or occasionally 
insert data to or remove data from the database. The update to the database may cause 
not only the generation of new rules but also the invalidation of some existing rules. 
Thus, the maintenance of the association rules is a significant problem. One possible 
approach to the problem is to rerun the mining algorithm on the updated database. 
This approach is simple but inefficient since it does not use the existing information 
such as the support counts of the current large itemsets. In [2], FUP algorithm is 
proposed for the maintenance of discovered association rules in large databases. If 
there are frequent itemsets in the increment database, which are not large itemsets of 
the original database, then the algorithm scans the original database to check whether 
they are large itemsets or not. FUP, FUP* [3] and FUP2 [4] all belong to the kind of 
generate-and-test algorithms, and are referred to as k-pass algorithms because they 
have to scan the database k times. Two algorithms DIUP and DDUP [5], extended 
from DLG [7], are proposed to handle the incremental update problem. An 
association graph and an information table are constructed to record whether a 
previously generated large itemset remains large in the updated database. With this 
information, these two algorithms can finish the update work in one database scan. 

In this paper, we present efficient approaches to maintain large itemsets. Our 
approaches store the itemsets that are not large in the original database but may 
become large itemsets after the database is updated, so that the cost of processing the 
updated database can be reduced. Moreover, we discuss the cases where the large 
itemsets can be obtained without scanning the original database. Experimental results 
show that our algorithms outperform FUP and FUP2 algorithms, especially when the 
original database need not be scanned in our algorithms. 

The remaining of the paper is organized as follows. The problem description is 
given in Section 2. In Section 3, our approaches are proposed for the maintenance of 
large itemsets. Performance results are presented in Section 4. Section 5 contains the 
conclusion. 



2 Problem Description 

Let L be the set of all the large itemsets in the original database DB, s be the 
minimum support threshold, and d the number of transactions in DB. Assume that the 
support count of each large itemset in the original database has been recorded. After 
some updates, the increment database DB"^ is added into the original database DB 
and the decrement database DB^ is removed from the original database, resulting in 
the updated database DB' . The numbers of transactions in DB^ , DB^ , and DB' are 
denoted d"^, d“, and d', respectively, and the support counts of itemset X in DB, 
DB"" , DB“ , and DB' are , and , respectively. With 

the same minimum support threshold s, a large itemset X of DB remains a 
large itemset of DB' if and only if its support in DB' is no less than s, i.e. 

x;„,„,>(d + d"-d-)*s = d'*s. 
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All the itemsets in DB can be divided into the following four groups: 



Group 1 : the itemsets which are large in both DB and DB' . 

Group 2: the itemsets which are large in DB but not large in DB' . 

Group 3 : the itemsets which are not large in DB but large in DB' . 

Group 4: the itemsets which are neither large in DB nor DB' . 

The itemsets of group 4 can not be used to generate any rule because their supports 
are less than the minimum support threshold. Thus, these itemsets are indifferent. The 
itemsets in group 1 and group 2 can be obtained easily. For example, X is a large 

itemset of DB. The support count of X in DB' is = + X;Tou„t , 

where is known and and are available after scanning DB"^ 

and DB , respectively. Flowever, the itemsets of group 3 need further consideration. 
The support counts of these itemsets in DB' can not be determined after 

scanning DB ^ and DB since their support counts in DB are unknown in advance. 

Thus, how to efficiently generate the itemsets of group 3 is an important problem for 
the maintenance of association rules. In the following section, we propose efficient 
approaches to solve the problem. 



3 Our Approaches 

Basically, the framework of our algorithm is similar to that of the FUP algorithm [3]. 
We store the large itemsets of DB and their support counts in the set L. Let t be the 
tolerance degree, 0 < t < s. The degree is used to control the number of itemsets 
whose supports are less than the minimum support threshold s but no less than (s-t). 
We call these itemsets the potential large itemsets. The potential large itemsets and 
their support counts are stored in the set PL. It is easy to see that the more the number 
of the potential large itemsets is, the less the cost of processing the original database 
will be. Besides, we record the itemsets of the increment database, that are not in L 
and PL but may become large itemsets, in the set M. 



3,1 Insertion 

Let DB and DB"^ be the original database and the increment database, respectively. 
Assume that the size of DB and DB"^ are d and d^ , respectively, and sets L and PL 
are available before we update DB. When the increment database DB"^ is scanned, 
the support counts of the itemsets in LuPL are accumulated. Then L and PL are 

updated according to the support count thresholds s*(d + d'^) and (s-t)* (d + d"^) , 
respectively. Let X be an itemset of DB^ and X^ LfJPL. By Theorem 1, if the 
support count of X in DB"^ is greater thand"^ *(s-t), then X may be a large or 
potential large itemset of the updated database DB . In the case, the itemset X is 
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stored in set M. Let MS be the maximal value of the support counts of itemsets in M. 
By Theorem 2, if MS < *s+ d*t, we need not scan DB. Namely, in this case all 

itemsets in M can not be large itemsets of the updated database. The Insert algorithm 
is shown in [6]. 

Theorem 1 : Assume that the support counts of itemset X in the original database DB, 
the increment database DB^ , and the updated database DB are , 

and , respectively. If X^ L u PL and < d^ * (s - 1) , then X is not a large 

or potential large itemset of DB . 



[PROOF] 



X., 



= X 



count 



+x 



+ 

count 



<d*(s-t) + x;„^„, 

< d * (s - 1) + d"^ * (s - 1) 



= (d + d-")*(s-t) 



Therefore, X is not a large or potential large itemset of DB . 
Theorem 2: If MS < d^ *s+ d*t, then we need not scan the database DB. 



[PROOF] 

Assume that X is the itemset in M, which has the maximal support 
count. 

Since X g L u PL, < d * (s - 1). 

Therefore, if MS + d*(s-t)<(d + d’^)*s 
(that is, MS <(d'^*s + d* t)), then X is not a large itemset in the 
updated database DB . In other words, all the itemsets in M can 
not be large itemsets of DB . 



3.2 Deletion 

The user may remove out-of-date data from the database. In the following, we discuss 
the maintenance of association rules when deletions occur. Let DB and DB' be the 
original database and the decrement database, respectively. Assume that the sizes of 

DB and DB' are d and d , respectively, and sets L and PL are available before we 
update database DB. When the decrement database DB' is scanned, the support 
count of the itemset in L u PL lessens if this itemset appears in DB' . Then L and PL 
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are updated aeeording to the support eount thresholds s* (d - d' ) and (s-t)* (d - d' ) , 
respeetively. Let X be an itemset of DB and X^LuPL. By Theorem 3, 

ifd'<d*t/s, then we need not proeess the updated database (DB-DB'). 
Otherwise, we use DLG algorithm [7] to generate the large itemsets in the updated 
database. The Delete algorithm is shown in [6]. 

Theorem 3: If d'<d*-, i.e., t>s* — , then we need not sean the updated 
s d 

database DB . 

[PROOF] 

Assume that X is an itemset of DB. 

<1>Xg LuPL 
Sinee 

and X;,Qy„j and are available, 

we need not sean DB . 

<2>Xg LuPL 

Sinee ^ count 

<d*(s-t)-x;„„„t 
<d*s-d*t 

< d * s — d * s * — 

d 

= d*s-d'*s = d *s 
X is not a large itemset of DB . 

^ We need not sean DB . 

By < 1 >< 2 > , the theorem is proved. 



3,3 Update with Insertion and Deletion 

In the following, we eonsider insertion and deletion at the same time. Let DB, 
DB"^ , and DB“ be the original database, the inerement database, and the deerement 
database, respeetively. Assume that the sizes of DB, DB^ , and DB“ are d, d^, 

and d , respeetively. When DB^ and DB“ are seanned, the support eount of the 
itemset in L u PL is updated. Consider the following two eases. 
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Case 1: > d 

( 1 ) Let X be an itemset of DB ^ and L u PL. If the support eount of X in DB ^ is 

no less than (d ^ -d )*(s-t), it may be a large or potential large itemset of the 
updated database. In this case, itemset X is stored in set M. It is proved as follows: 



[PROOF] 

Assume X < (d -d ” )*(s-t). 



X =X +X"^ — x~ 

^ count ^ count ' ^ count ^ count 



< d *(s- 1)+ (d^ -d )*(s-t) 
<(d + d-"-d“)*(s-t) 



Thus X is not a large or potential large itemset of the updated database. 

(2) Let X be an itemset of DB but not an itemset of DB"^. Assume Xg LuPL. 

Then X can not be a large itemset of the updated database. It is proved as follows: 
[PROOF] 



X =X +X"^ — x~ 

^ count ^ count ~ ^ count ^ count 



<d*(s-t) 



<(d + d^-d“)*(s-t) 

Thus X is not a large or potential large itemset in the updated database. 

Case 2: d^< d^ 

Let X be an itemset of DB but not an itemset of DB^or DBA Assume 
X^ L u PL. Then X may be a large itemset in the updated database. It is proved as 
follows: 

[PROOF]. 



X =X -i-X'^ — X"* 

^ count ^ count ~ ^ count ^ count 



<d*(s-t) 



Since d*(s-t)>(d + d^-d )*(s-t),X may be a large itemset of the 
updated database. In the case, we use DLG algorithm [7] to process the 
updated database and generate the sets L and PL. 
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Let MS be the maximal value of the support eounts of itemsets in M. By Theorem 4, if 
MS <d*t + cT^*s-d*s, we need not proeess the updated database. The Update 
algorithm is shown in [6]. 

Theorem 4: If MS' < d * t + d^ *s-d” *s , we need not sean the updated 
database DB . 

[PROOF] 

Assume that X is an itemset of DB u DB^ . 

<1>Xg LuPL 
Sinee 

and X^gy^j , are available, we need not sean DB . 

<2>XgLuPL 
Sinee 

<d*(s-t)+x;r„„„j-x;:„„„j 
<d*s-d*t + M5-X:„„„, 

<d*s-d*t + MS 

<d*s-d*t + (d*t + d^*s-d*s) 

= d*s + d'^ *s-d“ *s = d *s 
=> X is not a large itemset of DB . 

=> We need not scan DB . 

By < 1 >< 2 > the theorem is proved. 



4 Experimental Results 

To assess the performance of our algorithms for the maintenance of large itemsets, we 
perform several experiments on Sun SPARC/10 workstation. 



4,1 Generation of Synthetic Data 

In the experiments, we used synthetic databases to evaluate the performance of the 
algorithms. The data is generated using the same approach introduced in [2]. The 
parameters used to generate the database are described in Table 1. 
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Tablel. The parameters 



D 


Number of transactions in the original database DB 


d\d" 


Number of transactions in the increment/decrement 


databases ( DB ' / DB ) 


|/| 


Average size of the potentially large itemsets 


\MI\ 


Maximum size of potentially large itemsets 


\L\ 


Number of the potentially large itemsets 


\T\ 


Average size of the transactions 


\MT\ 


Maximum size of the transactions 



We generate the dataset by setting W=1000, d=100,000, |7-|=2000, |/|=3, \MI\=5, 
|7]=5, and |M7]=10. The way we insert the increment database DB"^ is to put the d^ 
transactions in the rear of the original database. And the way we delete the decrement 
database DB' is to remove the first d' transactions from the original database. 



4,2 Comparison of Insert Algorithm with FUP 

We perform an experiment on the dataset with minimum support 1%. The experiment 
run eight insertions of increment databases and the relative execution time of each 
insertion is recorded for Insert algorithm and FUP algorithm. A new database is 
generated for each insertion. Fig. 1 shows the relative execution times in the cases of 
t=0.1% and t=0.5%. The size of each increment database is 1% of the size of the 
original database. In this experiment, Insert algorithm takes much less time than FUP 
algorithm since Insert algorithm does not scan the original database. 
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Fig. 1. Relative execution time for Insert and FUP (d :d = 1 TOO) 
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4,3 Comparison of Delete Algorithm with FUP2 

In the following, we compare Delete algorithm with FUP2 algorithm. The minimum 
support considered is also 1%. In Fig. 2, the size of each decrement database is 1% of 
the size of the original database and t=0.1% and t=0.5% are considered respectively. 
It can be seen that Delete algorithm take much less time than FUP2. 
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Fig. 2. Relative execution time for Delete and FUP2 (d :d = 1:100) 



4,4 Comparison of Update Algorithm with FUP2 

We assess the performance of Update algorithm considering insertion and deletion at 
the same time. In Fig. 3, the sizes of the increment database and the decrement 
database are 0.1% of the size of the original database. It can be seen that Update 
algorithm takes much less time than FUP2 algorithm since the size of the original 
database is much larger than that of the increment database and the decrement 
database. 
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Fig. 3. Relative execution time for Update and FUP2 (d :d :d= 1:1:1000) 
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5 Conclusion 

The database users may frequently or occasionally update the database. This behavior 
would change the characteristic of the database. To efficiently discover and maintain 
large itemsets in a large database, we present three algorithms to solve the problem: 

1 . Insert algorithm for the insertion of the increment database to the original database. 

2. Delete algorithm for the deletion of the decrement database from the original 
database. 

3. Update algorithm considering both the increment database and the decrement 
database at the same time. 

In our algorithms, we store more information than other algorithms so that the cost 
of processing the original database can be reduced. Moreover, we discuss the cases 
where the large itemsets can be obtained without scanning the original database. 
Experimental results show that our algorithms outperform other algorithms, especially 
when the size of the original database is much larger than that of the increment 
database and the decrement database. 
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Abstract. Discovering association rules among items in large databases is 
recognized as an important database mining problem. The problem has been 
introduced originally for sales transaction database and did not relate to missing 
data. However, missing data often occur in relational databases, especially in 
business ones. It is not obvious how to compute association rules from such 
incomplete databases. It is provided and proved in the paper how to estimate 
support and confidence of an association rule Induced from an incomplete 
relational database. We also introduce definitions of expected support and 
confidence of an association rule. The proposed definitions guarantee some 
required properties of itemsets and association rules. Eventually, we discuss 
another approach to missing values based on so called valid databases and 
compare both approaches. 



1 Introduction 

Discovering association rules among items in large databases is recognized as an 
important database mining problem. The problem was introduced in [1] for sales 
transaction database. The association rules identify sets of items that are purchased 
together with other sets of items. For example, an association rule may state that 80% 
of customers who buy fish buy also white wine. Several extensions and variations of 
the notion of an association rule were offered in the literature (see e.g. [5, 6 ,14, 15]). 
One of such extensions is a generalized rule that can be discovered from a taxonomic 
database [14]. It has been proposed languages (see e.g. [9] for an SQL-extension) for 
mining for associations rules satisfying user imposed constraints. Applications for 
association rules range from decision support to telecommunications alarm diagnosis 
and prediction. 

Originally, missing values have not been considered for a transaction database. 
However, when one tries to discover associations between values of different 
attributes one may often face the problem of missing values. The problem is 
especially acute for business databases. Missing data may result from errors, 
measurement failures, changes in the database schema etc. 

Several solutions to the problem of generating decision tree or decision rules from 
the training set of examples with imknown values have been proposed in the area of 
artificial intelligence. The simplest among them consist in removing examples with 
unknown values or replacing unknown values with the most common values. More 
complex approaches were presented in [4, 1 1]. A Bayesian formalism is used in [4] to 
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determine the probability distribution of the unknown value over the possible values 
from the domain. This method could either choose the most likely value or divide the 
example into fractional examples, each with one possible value weighted according to 
the probabilities determined. It is suggested in [11] to predict the value of an attribute 
based on the value of other attributes of the example, and the class information. These 
and other approaches to missing values were proposed also in the area of rough sets 
(e.g. [7, 8]; see the overview in [3]). 

In [12] it was proposed yet another treatment of missing values. The main idea was 
to cut a database into several valid databases without missing values. The notions of 
support and confidence of an association rule were redefined. It was shown how to 
modify efficient data mining algorithms [2] so that to compute association rules from 
incomplete databases. 

In the paper we investigate properties of databases with missing attribute values. 
We provide and prove pessimistic and optimistic estimations of support and 
confidence of an association rule. Whatever are a real support and confidence of a 
rule they do not exceed the estimated thresholds. This part of the paper is an extension 
of the work [8] where classification rules generation from incomplete decision Fig.s 
was considered in the context of rough sets [10]. 

In the paper we offer also definitions of expected values of support and confidence 
based on known attribute values in the database. The proposed definitions guarantee 
some required properties of itemsets and association rules. Finally, we present in 
more detail the approach to missing values offered in [12] and point out its 
shortcomings. 

2 Association Rules 

2.1 Association Rules in Transaction Databases 

The definition of a class of regularities called association rules and the problem of 
their discovering were introduced in [1]. Here, we describe this problem after [1, 2]. 
Let /= {I'l, h, ..., im\ be a set of distinct literals, called items. In general, any set of 
items is called an itemset. Let D be a set of transactions, where each transaction T is a 
set of items such that T c /. An association rule is an expression of the form: W=> T, 
where 0 ^X,Y<^ I and Xn Y= 0. X is called the antecedent and Y is called the 
consequent of the rule. 

Statistical significance of an itemset X is called support and is denoted by sup{X). 
supiX) is defined as the percentage (or the number) of transactions in D that contain 
X. Statistical significance {support) of a rule X^Y is denoted by sup{X ^ T) and is 
defined as follows: sup{X ^ T) = sup{X u Y). 

Additionally, an association rule is characterized by confidence, which expresses 
its strength. The confidence of an association rule X^Y is denoted by confX => Y) 
and is defined as follows: confX ^ T) = sup{X ^ T) / sup{X). 

The problem of mining association rules is to generate all rules that have support 
greater than some user specified minimum support and confidence not less than a user 
specified minimum confidence. Several efficient solutions applicable for large 
databases were proposed to solve this problem (see [2, 13]). The problem of 
generating association rules is usually decomposed into two subproblems: 




86 



Marzena Kryszkiewicz 



1 . Generate all itemsets whose support exceeds the minimum support niinSup. The 
itemsets of this property are called frequent {large). 

2. Generate all association rules whose confidence is not less than the minimum 

confidence minConf. In the process of rules’ generation only frequent itemsets 
are considered. Let X he a frequent itemset and 0 7 c X Then any candidate 

rule X\ 7 => 7is association one if {sup{X) / sup{X\ Y)) > minConf. 

In both subproblems the following properties of itemsets are exploited extensively: 

• If Xc7 then sup{X) > sup{Y). 

• All subsets of a frequent itemset are frequent and all supersets of an infrequent 
itemset are infrequent. 

The properties above allow reducing considerably unnecessary (redundant) 

evaluations of candidate itemsets and candidate rules. 



2.2 Association Rules in Relational Databases 

Relational table is a pair D = (O, AT), where O - is a non-empty finite set of tuples 
and AT is a non-empty finite set of attributes, such that a: O -> ¥„ for any as AT, 
where Va is called the domain of a. The notion of a tuple corresponds to the notion of 
a transaction, but the size of a tuple is fixed. A relational item is meant to be an 
attribute-value pair (a,v), where asAT and vsVa. In the paper, we assume that a 
relational itemset is any set of relational items in which no attribute occurs more than 
oncef Relational association rules are constructed in usual way, but only from 
relational itemsets. The definitions of support and confidence are the same as for a 
transaction database. 

Example 1. Let Fig. 1 present a database D under consideration. 



Id 


XI 


X2 


X3 


X4 


1 


a 


a 


a 


c 


2 


a 


a 


b 


d 


3 


a 


b 


c 


c 


4 


a 


b 


d 


c 


5 


b 


b 


e 


d 


6 


b 


b 


f 


c 


7 


b 


c 


g 


c 


8 


b 


c 


h 


d 



Fig. 1. Exemplary database 



Fig. 2 contains exemplary association rules that hold in D: 



Y 


5wp(x=> y) 


conRX Y) 


{(XI, a)} ^ {(X4,c)} 


3/8 


3/4 


{(X4,c)} ^ {(XI, a)} 


3/8 


3/5 


{(Xl,a),(X2,b)} ^ {(X4,c)} 


2/8 


2/2 


{(X2,b),(X4,c){ ^ {(XI, a)} 


2/8 


2/3 



Fig. 2. Exemplary association rules with their support and confidence in D 



□ 



' In general, it may be useful to allow several items with the same attribute in an itemset. This 
is the case when association rules are to be discovered from groups of tuples (see e.g. [9]). 
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3 Incomplete Databases 

3.1 Basic Notions 

Computation of support and confidence of an association rule is not obvious in the 
case of databases with missing attribute values. Below, we introduce some notions 
that will be useful in investigating properties of such rules. Missing values will be 
denoted by 

We call an itemset X regular iff for every {a,v)^X\ v^*. Further on, we consider 
only regular itemsets. Sometimes X will play a role of a pattern: 

The maximal set of tuples that match the pattern X necessarily is denoted by n{X) 
and is defined as follows: n{X) = {xeD| V(a,v)eW: a(x)=v}. 

By m{X) we denote the maximal set of tuples that may match the pattern X in D, 
i.e. m{X) = {xeD| V(a,v)eW: a(x)s{v,*}}. 

The difference m{X) - n(X) is denoted by d(X), that is diX) is the set of tuples in D 
that are likely to match pattern W but this is not certain. 

By n(-X) we denote the maximal set of tuples that certainly do not match the 
pattern X in D, i.e. n{-X) = D \ m{X). 

By m(-X) we denote the maximal set of tuples that may not match the pattern X in 
D, i.e. m{-X) = D \ n{X). 

The difference m{-X) - n{-X) is denoted by d{-X). 

We may interpret -X in the following way: 

Let F{X) = {{(fli,Vi) ,..., (flfoVi)}| OisX, v,sVa, aj , for ij = 1, ..., k). Then, -X 
may be understood as a generalized pattern that is an alternative of all patterns in F{X) 
except X Further on, -X will be called anti-pattern ofX 

Example 2. Given the incomplete database D presented in Fig. 3, we will illustrate 
the introduced notions of necessary and possible patterns and anti-patterns matching. 



Id 


XI 


X2 


X3 


X4 


1 


* 


a 


a 


c 


2 


a 


a 


b 


* 


3 


a 


b 


c 


c 


4 


a 


b 


d 


c 


5 


* 


b 


e 


d 


6 


b 


b 


f 


* 


7 


b 


c 


g 


* 


8 


b 


c 


h 


* 



Fig. 3. Exemplary database with missing values 



Fig. 4 contains the respective sets of tuples that match exemplary itemsets-patterns. 



Itemset X 


n{X) 


m{X) 


d(X) 


«(-X) 


m{-X) 


d(-X) 


{(Xfa)} 


{2.3.4} 


(1,2, 3, 4, 5} 


{1.5} 


{6,7,8} 


(1,5, 6, 7,8} 


{1.5} 


{(X4.C)} 


{1.3.4} 


(1,2, 3, 4, 6, 7, 8} 


{2.6.V.8} 


{5} 


(2,5,6, 7, 8} 


{2, 6, 7, 8} 


{(Xl.a).(X4.c)} 


{3.4} 


(1,2, 3, 4} 


{1.2} 


(5, 6, 7, 8} 


{1,2,5,6,7,8} 


{1.2} 


{(Xl.a).(X2.b)} 


{3.4} 


{3.4.5} 


{5} 


(1,2, 6,7, 8} 


(1,2,5,6,7,8} 


{5} 


{(X2.b).(X4.c)} 


{3.4} 


{3.4.6} 


{6} 


(1,2, 5, 7, 8} 


{1,2,5,6,7,8} 


{6} 


{(Xl,a),(X2,b),(X4,c)} 


{3.4} 


{3.4} 


0 


(1,2, 5, 6, 7, 8} 


(1,2,5,6,7,8} 


0 



Fig. 4. Tuples matching (anti)pattem X 



Let us note that e.g. tuple 5 in Fig. 3 does not match the pattern {(Xl,a),(X4,c)} 
certainly, though the value of attribute XI is missing. Nevertheless, the value of 
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attribute X4 for tuple 5 is known and it is equal to d, which is different from the value 
c related to attribute X4 in the pattern {(Xl,a),(X4,c)}. So, whatever is the real value 
of tuple 5 for attribute XI, the tuple will not match the pattern {(Xl,a),(X4,c)}. 

□ 

Below we provide several properties of the introduced notions of pattern matching. 
Property 1. LetXhe an itemset in D. 

a) n(-X) = {xeD| 3(a,v)eX: a(x)g{v,*}} 

b) m(-X) = {xeD| 3(a,v)EX': a(x)^} 

c) d(X)= {xeD| V(a,v)EX': a(x)s{v,*} and 3(a,w)EJf: a{x)=*} 

d) n{X)r,n{-X) = 0 

e) n{X)r\m{-X) = 0 

f) m{X)r,n{-X) = 0 

g) m{X)r\m{-X) = d{X) 

h) d{X) = d{-X) 

i) n{X)KJd{X)KJn{-X) = D 

Property 2. LetX, 7 be itemsets in D and Jfc7. Then: 

a) n(X) ^n(Y) 

b) m(X) 3 m(Y) 

c) m(X) 3 n(Y) 

d) m{X) 3 d(Y) 

e) m(X) \ n{Y) 3 d{Y) 

f) d{X) nn(Y) = 0 

Property 3. LetX, 7 be itemsets in D. Then: 

a) m(XvjY) = m{X)r\m{Y) 

b) n{XKjY) = n{X)r\n{Y) 

c) diXyjY) = [n{X)r\d{Y)\ u [4^nn(7)] u {d{X)r\d{Y)\ 

3.2 Estimations of Support and Confidence 

In the case of an incomplete information system one may not be able to compute 
support of an itemset X precisely. Nevertheless, one can estimate the actual support by 
the pair of two numbers meaning the lowest possible support (pessimistic case) and 
the greatest possible support (optimistic case). We will denote these two numbers by 
pSup(X) and oSup(X), respectively. Clearly, 

pSup(X) = \n(X)\ / |D| and oSup(X) = \m(X)\ / |D|. 

The difference oSup(X) - pSup(X) will be denoted by dSup(X). 

Property 4. LetW, 7 be itemsets in D and Jfc7. Then: 

a) pSup(X) > pSup(Y) 

b) oSup(X) > oSup(Y) 

Proof: Follows immediately from Property 2. □ 

A similar problem arises when one tries to compute the confidence of a rule in an 
incomplete database. Let pConfiX^Y) and oConf(X^Y) denote the lowest possible 
confidence and greatest possible confidence of A=>7, respectively. The property 
below shows how to compute these values. 
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Property 5. Let A=^>7be a rule in an incomplete database D. 

a) pConfiX^Y) = \n{X)r\n(Y)\ / Qn{X)r\n(Y)\ + \m{X)r\m(-Y)\) 

b) oConfiX^Y) = \m{X)r\m{Y)\ / (\m{X)r\m{Y)\ + \n{X)r\n{-Y)\) 

Proof: Let us assume that there are k tuples which match both X and Y in reality, but 
not necessarily in D (i.e. these tuples belong to d{XKjY)). Let / be the number of tuples 
that in reality match X and do not match Y, but the contents of D is not sufficient to 
extract this information with certainty. Then in the real world, the confidence of the 
rule X^Y is equal to {\n{X)r\n{Y)\ + k) I {\n{X)r\n{Y)\ + k + \n{X)r\n{-Y)\ + /). 
However, if we have only the information from D, we should treat the confidence of 
as a function X^,/) = (|n(A)nn(T)| + k) ! {\n{X)r\n{Y)\ + k + \n{X)r\n{-Y)\ + /), 
where A:e[0, \m{X)r\m{Y)\ - \n{X)r\n{Y)\\, /e[0, \m{X)r\m{-Y)\ - \n{X)r\n{-Y)\\. 

Ad. a) One can easily notice that the function has infimum for minimal value of 
k and maximal value of /. Therefore, pConj(X^Y) =fiQ, \m{X)r\m{-Y)\ - \n{X)r\n{-Y)\) 
= \n{X)r\n{Y)\ / (|«(A)nn(T)| + \m{X)r\m{-Y)\). 

Ad. b) It can be easily noticed that the function /(A:, 0 has supremum for maximal 
value of k and minimal value of /. Therefore, oConfiX^Y) = fi\m{X)rMn{Y)\ - 
\n{X)c\n{Y)\, 0) = \m{X)rMn{Y)\ / {\m{X)rMn{Y)\ + |n(A)nn(-T)|). □ 

Estimations of how exemplary patterns are supported in the database from Fig. 3 
are given in Fig. 5, while estimations of confidence of exemplary association rules are 
placed in Fig. 6. 

3.3 Expected Values of Support and Confidence 

Although we know already how to estimate support and confidence of rules it may be 
confusing how to treat rules for which the difference between optimistic and 
pessimistic estimations is high. It would be useful to be able to predict values of 
support and confidence close to real (though unknown) ones. Therefore, we provide 
definitions of expected support and confidence of an association rule. 

The expected value of support for a pattern (or anti-pattern) X will be denoted by 
eSup{X). A definition of the expected support we will propose is based on the 
following assumptions: 

eSupiX) / eSup{-X) =pSup{X) / pSup{-X) and eSup{X) + eSup(-X) = 1. 
According to the above constraints we obtain the following definition of expected 
support for an itemset V: 

eSupiX) = pSup(X) / (pSup(X) + pSup{-X)). 

Property 6. Let V be an itemset in D. Then, eSup{X) =pSup(X) / (1 - dSup(X)). 

Proof: Follows immediately from the definition of expected support and Prop. l(i). 

□ 

Property 7. Let Abe an itemset in D. Then, pSup(X) < eSup{X) < oSup{X) 

Proof: 

pSupiX) / eSupiX) = 1 - dSupiX) < 1. So, pSup{X) < eSup{X). 
oSup(X) - eSup(X) = \pSupiX) + dSup(X)\ - \pSup(X) / (1- dSup{X))'\ = 

= dSupiX) * {-pSup{X) + 1 - dSup{X)\ / [1 - dSup{X)\ > 0. So, oSup(X) > eSup{X). 

□ 
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Property 8. LetX, 7 be itemsets in D and Jfc7. Then: eSup{X) > eSup{Y). 

Proof: First, let us observe that the cardinality of d{Y) is maximal if d{Y) = 
m{X) \ n{Y) (see Property 2(e)). On the other hand, m{X) \ n{Y) = {n{X)'ud{Y)) \ n(Y) = 
in(X) \ n(Y)) u (d(X) \ n(Y)) and id(X) \ n(Y)) = d(X) (by Property 2(f)). So, d(Y) is 
maximal for d(Y) = (n(X) \ n(Y)) u d(X). By Property 6 we have: 

eSupiX) = {\n(X)\ / |D|) / (|1 - \dSup(X)\) and eSup(Y) = (\n(Y) / |D|) / (1 - \dSup(Y)\). 

Let a = |«(7)| / |D|, p = |n(JQ \ n(7)| / |D| and X ^ (1 ■ \dSup(X)\) * (|1 - (P + 
\dSup{X)\)). Then: eSup{X) - eSup(Y) = [(a+P) / (1 - \dSup{X)\) - a / (1 - \dSup{Y)\)] > 
[(a+P) / (1 - \dSup{X)\) - a / (1 - (P + \dSup{X)\))\ = 

[(a+P) * (1 - P - \dSup{X)\) - a * (1 - \dSup{X)\)\ / x = 

[-ap + P - P^ - ^\dSup{X)W / X = P* [1 - (a + P + \dSup{X)\)\ / x = 

P* [1 - |m(T0|] / X - 0- Hence, eSup(X) > eSup(Y). □ 

The expected confidence of a rule X^Y will be denoted by eConfX^Y) and we 
define it in usual way: eConfX^Y) = eSupiXjY) / eSup{X). 

Property 9. LetT+^>7be a rule in an incomplete database D. Then: 
pConfX^Y) < eConfX^Y) < oConfX^Y) 

Proof: Property 5 allow us to conclude that for any support of T=>7, which belongs 
to \pSup{XxjY), oSup{X^Y)\, and for any support of the antecedent of the rule T=>7, 
which belongs to \pSup{X), oSup{X)\, the confidence of X^Y belongs to 
\pConfX^Y), oConfX^Y)\. By Property 7, both eSup(X^Y) e [pSup(X<jY), 
oSup(XkjY)] and eSup{X) e \pSup{X), oSup{X)\. Hence, eConfX^Y) e 
\pConfX^Y), oConfX^Y)]. □ 

Example 3. Let us consider the database Irom Fig. 3. In Fig. 5, we place pessimistic, 
optimistic and expected supports for exemplary itemsets. The values can be quickly 
computed by applying extracted information on D that is contained in Fig. 4. 



Itemset X 


pSupiX) 


oSupiX) 


dSup(X) 


pSup(-X) 


eSup(X) 


{(XLa)} 


3/8 


5/8 


2/8 


3/8 


3/6 


{(X4.C)} 


3/8 


7/8 


4/8 


1/8 


3/4 


{(Xl,a),(X4,c)} 


2/8 


4/8 


2/8 


4/8 


2/6 


((XLa),(X2,b)} 


2/8 


3/8 


1/8 


5/8 


2/7 


{(X2,b),(X4,c)} 


2/8 


3/8 


1/8 


5/8 


2/7 


{(Xl,a),(X2,b),(X4,c)} 


2/8 


2/8 


0 


6/8 


2/8 



Fig. 5. Estimated and expected values of support of exemplary itemsets 



Fig. 6 is filled with pessimistic, optimistic and expected values of confidence of 
rules from Example 1. One may use information in Figures 4-5 to derive respective 
supports of the rules and their antecedents. 



Y 


pConfQL => y) 


oConfQL => 10 


eConfQL => 10 


((Xl,a)} ^ {(X4,c)} 


2/3 


4/4 


2/3 


{(X4,c)} ^ {(Xl,a)} 


2/6 


4/4 


4/9 


{(Xl,a),(X2,b)} ^ {(X4,c)} 


2/3 


2/2 


7/8 


!(X2,b),(X4,c)}^ {(Xl,a)} 


2/3 


2/2 


7/8 



Fig. 6. Estimated and expected values of confidence of rules 



□ 
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4 Another Approach Based on Valid Databases 

In this section we remind briefly and analyse another approach to missing values 
presented recently in [12]. Let us start with the notations applied in this approach: 

A tuple is disabled for X in D, if it contains missing values for at least one item in 
X. Dis(X) denotes the subset of D disabled for A, i.e. Dis(X) = {3(a,v)sX: a(x)=*}. 
The valid database (vdb) for an itemset Ais defined as follows: vdb(X) = D \ Dis(X). 

Support (vSup) of an itemset A is computed in vdb(X) and is defined in the 
following way: vSupiX) = \n{X)\ / \vdb{X)\ = \n(X)\ / (|Z)|-|Z)w(A)|). 

Confidence {vConf) of a rule A^ F is computed in vdb(XKjY) and is defined as 
follows: vConfiX^ Y) = |n(AuF)| / {\n{X)\ - (|n(A)|n|Z)A(F)|)]. 

Property 10. Let A, Fbe itemsets in D and AcF Then: 

a) Dis(X) c Dis(Y) 

b) d(X) c Dis(X) 

Proof: Ad. 10(b). Follows immediately from Prop. 1(c) and the definition of Dis(X). 

□ 

Example 3. Let us consider the incomplete database D presented in Fig. 3. In Fig. 7, 
we placed disabled tuples and support computed in valid databases for exemplary 
itemsets. For comparison purposes, it was added estimated and expected supports of 
these itemsets. 



Itemset X 


Dis(X) 


vSup(X) 


pSup(X) 


oSup(X) 


eSup(X) 


d(X) 


!(Xl.a)} 


{1,5} 


3/6 


3/8 


5/8 


3/6 


{1,5} 


!(X4,c)} 


{2,6, 7,8} 


3/4 


3/8 


7/8 


3/4 


{2,6,7,8} 


{(Xl,a),(X4,c)} 


{1,2,5,6,7,8} 


2/2 


2/8 


4/8 


2/6 


{1,2} 


{(Xl,a),(X2,b)} 


{1,5} 


2/6 


2/8 


3/8 


2/7 


{5} 


{(X2,b),(X4,c)} 


{2,6,7,8} 


2/4 


2/8 


3/8 


2/7 


{6} 


((XLa),(X2,b),(X4,c)} 


{1,2,5,6,7,8} 


2/2 


2/8 


2/8 


2/8 


0 



Fig. 7. Estimated and expected values of support of exemplary itemsets 



Let us compare differently computed supports for the itemset A= {(Xl,a),(X4,c)}. 
Let us note that vSup(X) does not belong to \pSup(X), oSup(X)], which means that 
there is no substitution for missing values in the database that would produce such a 
support for A Additionally, we may notice that vSup(X), which is equal to 1, is higher 
than the supports of proper subsets of A, which are equal to 3/6 for {(XI, a)} and 3/4 
for {(X4,c)}, respectively! 

Let us also observe how quickly increases the difference between Dis(X) and d(X) 
when adding items to any itemset A. 

In Fig. 8 we put confidence of rules computed in valid databases. Additionally, we 
placed estimated and expected values of confidence of the rules. 



Y 


vConfiX => y) 


pConflX Y) 


oConfiX => y) 


eConf[X => y) 


{(Xl,a)} ^ {(X4,c)} 


1 


2/3 


1 


2/3 


{(X4,c)} ^ {(Xl,a)} 


1 


1/3 


1 


4/9 


!(Xl,a),(X2,b)} ^ {(X4,c)} 


1 


2/3 


1 


7/8 


!(X2,b),(X4,c)}^ }(Xl,a)} 


1 


2/3 


1 


7/8 



Fig. 8. Exemplary association rules for the database from Fig. 3 



□ 
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Let us conclude Example 3 with the property beneath: 

Property 11. LetX, Fbe itemsets in D. What follows does not hold in general: 

a) vSup{X) e \pSup{X), oSup{X)\ 

b) If XdY then vSup(X) > vSup(Y) 

□ 

Property 12. LetX^Yhe a rule in an incomplete database D. 

pConfiX^Y) < vConfiX^Y) 

Proof: In the proof we will apply the following equivalences: 

• \n{X)\ = \n{X)r^n{Y)\ + \n{X)r^Dis{Y)\ + |n(W)n[D \ {Dis{Y)Kjn{Y))]\, 

• n{Xd)Y) = n{X)r\n{Y), 

. \m{-Y)\ + \d{Y)\ = |D|. 

Now, we may rewrite vConf and pConf as follows: 
vConAX^Y) = \n{X)n{Y)\ / [\n{X)nn{Y)\ + |n(^n[D \ {Dis{Y)^n{Y))]\, 
pConfiX^Y) = \n{X)r\n{Y)\ / {\n{X)r\n{Y)\ + |/«(W)n[D \ fif(F)]|- 
By Property 10(b) d{Y) d Dis(Y), so [D \ (Dis(Y)d)n(Y)] £ [D \ d{Y)\. Additionally, 
n{X)dm{X). Hence, vConf{X => F) < pConf{X^Y). 

□ 

Property 13. Let X^Y be a rule in an incomplete database D. The following 
statement does not hold in general: vConj[X^Y) < oConfiX^Y) 

Proof: We will proof Property 13 by showing an example of a database D and an 
association rule X^Y such that vConfiX^Y) > oConf{X^Y) in D. Let D be a 
database as in Fig. 9. Let W= {(Xl,a)} and F= {(X2,a),(X3,a)}. Then, vConfiX^Y) = 
1, whereas oConf{X^Y) = 1/2. 



Id 


XI 


X2 


X3 


1 


a 


a 


a 


2 


a 


b 


* 


3 


b 


b 


b 



Fig. 9. Exemplary database with missing values 



□ 



5 Conclusions 

Support and confidence of association rules Induced from incomplete databases may 
not be computable but they can be precisely estimated. It was shown in the paper how 
to estimate them. Any real support and confidence of a rule belong to the estimated 
thresholds. Additionally, we offered definitions of expected values of support and 
confidence of a rule. Both pessimistic, optimistic as well as expected support of 
itemsets decrease when adding items. This is required and important feature of 
itemsets that speeds up rule mining process considerably. In the process of looking for 
association rules a user may decide whether constraints on support and confidence 
should be imposed on estimated or expected values. 

We analysed also another approach to missing values that was presented in [12]. 
We proved that applying this approach does not guarantee that 1) the support of an 
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itemset is not greater than the support of its subset, 2) the support of an itemset 

belongs to the estimated threshold, 3) the confidence of an association rule is less than 

the optimistic estimation. 
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Abstract. Data mining is becoming increasingly important since the 
size of databases grows even larger and the need to explore hidden rules 
from the databases becomes widely recognized. Currently database sys- 
tems are dominated by relational database and the ability to perform 
data mining using standard SQL queries will definitely ease implementa- 
tion of data mining. However the performance of SQL based data min- 
ing is known to fall behind specialized implementation. In this paper we 
present an evaluation of parallel SQL based data mining on large scale 
PC cluster. The performance achieved by parallelizing SQL query for 
mining association rule using 4 processing nodes is even with C based 
program. 

Keywords: data mining, parallel SQL, query optimization, PC cluster 



1 Introduction 

Extracting valuable rules from a set of data has attracted lots of attention from 
both researcher and business community. This is particularly driven by explosion 
of the information amount stored in databases during recent years. 

One method of data mining is finding association rule that is a rule which 
implies certain association relationship such as ’’occur together” or ’’one implies 
the other” among a set of objects [1]. This kind of mining is known as CPU 
power demanding application. This fact has driven many initial researches in 
data mining to develop new efficient mining methods [1][2]. However they imply 
specialized systems separated from the database. 

Therefore there are some efforts recently to perform data mining using re- 
lational database system which offer advantages such as seamless integration 
with existing system and high portability. Methods examined are ranging from 
directly using SQL to some extensions like user defined function (UDF) [4]. Un- 
fortunately SQL approach is reported to have drawback in performance. 

* Currently at Information & Communication System Development Center, Mitsubishi 
Electric, Ohfuna 5-1-1 Kamakura-shi Kanagawa-ken 247-8501, Japan 
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However the increasing computing power trend continues and more can be ex- 
pected by utilizing parallel processing architecture. More important, most major 
commercial database systems have included capabilities to support paralleliza- 
tion although no report available about how the parallelization affects the per- 
formance of complex query required by data mining. This fact motivated us to 
examine how efficiently SQL based association rule mining can be parallelized 
using our shared-nothing large scale PC cluster pilot system. We have also com- 
pared it with well known Apriori algorithm based C program [2]. 



2 Association Rule Mining Based on SQL 



An example of association rule mining is finding ”if a customer buys A and B 
then 90% of them buy also C” in transaction databases of large retail organiza- 
tions. This 90% value is called confidence of the rule. Another important param- 
eter is support of an itemset, such as {A,B,C}, which is defined as the percentage 
of the itemset contained in the entire transactions. For above example, confidence 
can also be measured as support({A,B,C}) divided by support({A,B}). 

A common strategy to mine association rule is: 



1. Find all itemsets that have transaction support above minimum support, 
usually called large itemsets. 

2. Generate the desired rules using large itemsets. 

Since the first step consumes most of processing time, development of mining 
algorithm has been concentrated on this step. 

In our experiment we employed ordinary standard SQL query that is similar 
to SETM algorithm [3]. It is shown in figure 1. 



CREATE TABLE SALES (id Int, Item int); 

- PASS 1 

CREATE TABLE C_1 (item_l int, cnt int); 
CREATE TABLE R_1 (id int, item_l int); 

INSERT INTO C_1 

SELECT item AS item.l, COUNT(*) 

FROM SALES 

GROUP BY item 

HAVING COUNT(*) >= :min_support; 

INSERT INTO R_1 

SELECT P-id, p.item AS item.l 

FROM SALES p, C_1 c 

WHERE p.item = c. item.l; 

- PASS k 

CREATE TABLE RTMP.k (id int, item.l int, 
item. 2 int, ... , item.k int) 

CREATE TABLE C.k (item.l int, 

item. 2 int, ... , item.k int, cnt int) 
CREATE TABLE R.k (id int, item.l int, 

item. 2 int, ... , item.k int) 

INSERT INTO RTMP.k 

SELECT P'id, p. item.l, p.item. 2, ... , 
p.item.k-1, q.item.k-1 
FROM RJt-1 p, R.k-1 q 



WHERE 

AND 

AND 



p.id = q.id 

p. item.l = q. item.l 
p.item. 2 = q.item.2 



AND p.item.k-2 = q.item.k-2 

AND p.item.k-1 < q.item.k-1; 

INSERT INTO C.k 

SELECT item.l, item. 2, ..., item.k, 
COUNT(*) 

FROM RTMP.k 

GROUP BY item.l, item.2, ..., item.k 

HAVING COUNT(*) >= :min.support; 



INSERT INTO R.k 

SELECT P'id, p. item.l, p. item.2, 
p. item.k 

FROM RTMP.k p, C.k c 

WHERE p. item.l = c. item.l 

AND p.item. 2 = c. item.2 



AND p. item.k = c. item.k; 

DROP TABLE R.k-1; 

DROP TABLE RTMP.k; 



Fig. 1. SQL query to mine association rule 
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Transaction data is normalized into the first normal form (transaction ID, 
item) . In the first pass we simply gather the count of each item. Items that 
satisfy the minimum support inserted into large itemsets table C_1 that takes 
form(it em, item count). Then transaction data that match large itemsets stored 
in R_l. 

In other passes for example pass k, we first generate all lexicographically 
ordered candidate itemsets of length k into table RTMP_k by joining k-1 length 
transaction data. Then we generate the count for those itemsets that meet min- 
imum support and included them into large itemset table C_k. Finally trans- 
action data R_k of length k generated by matching items in candidate itemset 
table RTMP_k with items in large itemsets. 

Original SETM algorithm assumes execution using sort-merge join. Inside 
database server on our system, relational joins are executed using hash joins 
and tables are partitioned over nodes by hashing. As the result, parallelization 
efficiency is much improved. This approach is very effective for large scale data 
mining. 

Inside execution plan for this query, join to generate candidate itemsets is 
executed at each node independently. And then after counting itemsets locally, 
nodes exchange the local counts by applying hash function on the itemsets in 
order to determine overall support count of each itemset. Another data exchange 
occurs when large itemsets are distributed to every nodes. Finally each node 
independently execute join again to create the transaction data. 

3 Performance Evaluation 

3.1 Parallel Execution Environment 

The experiment is conducted on a PC cluster developed at Institute of Industrial 
Science, The University of Tokyo. This pilot system consists of one hundred 
commodity PCs connected by ATM network named NEDO-100. We have also 
developed DBKernel database server for query processing on this system. Each 
PC has Intel Pentium Pro 200MHz CPU, 4.3GB SCSI hard disk and 64 MB 
RAM. 

The performance evaluation using TPC-D benchmark on 100 nodes cluster is 
reported [7]. The results showed it can achieve significantly higher performance 
especially for join intensive query such as query 9 compared to the current com- 
mercially available high end systems. 

3.2 Dataset 

We use synthetic transaction data generated with program described in Apri- 
ori algorithm paper[2] for experiment. The parameters used are : number of 
transactions 200000, average transaction length 10 and number of items 2000. 
Transaction data is partitioned uniformly correspond to transaction ID among 
processing nodes’ local hard disk. 
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3.3 Results 

The execution times for several minimum support is shown in figure 2 (left). The 
result is surprisingly well compared even with directly coded Apriori-based C 
program on single processing node. On average, we can achieve the same level 
of execution time by parallelizing SQL based mining with around 4 processing 
nodes. The speedup ratio shown in figure 2(right) is also reasonably good, al- 
though the speedup seems to be saturated as the number of processing nodes 
increased. As the size of the dataset asssigned to each node is getting smaller, 
processing overhead and also synchronizing cost that depends on the number of 
nodes cancel the gain. 



140 



10 




Fig. 2. Execution time(left) Speedup ratio(right) 



Figure 3(left) shows the time percentage for each pass when the minimum 
support is 0.5%. Eight passes are necessary to process entire transaction data- 
base. It is well known that in most cases the second pass generates huge amount 
of candidate itemsets thus it is the most time consuming phase. Figure 3(right) 
shows the speedup ratio for each pass. The later passes, the smaller candidate 
itemsets. Thus non-negligible parallelization overhead become dominant espe- 
cially in passes later than five. Depending on the size of candidate itemsets, we 
could change the degree of parallelization. That is, we should reduce the number 
of nodes on later passes. Such extensions will need further investigations. 

4 Summary and Conclusion 

In this paper, we have compared the parallel implementation of SQL based 
association rule mining with the directly coded C implementation. Since SQL 
is popular and it has already sophisticated optimizer, we believed that parallel 
SQL system could achieve reasonably sufficient performance. 

Through real implementation, we have confirmed that parallization using 4 
processing nodes for association rule mining can beat the performance of directly 
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Fig. 3. Pass analysis (minimum support 5%). Contribution of each pass in exe- 
cution time(left) Speedup ratio of each pass(right) 



coded C implementation. We do not have to buy or write special data mining 
application codes, SQL query for association rule mining is extremely easy to 
implement. It is also very flexible, we have extended the SQL query to handle 
generalized association rule mining with taxonomy. The report will be available 
soon. 

At present, parallel SQL is running on expensive massively parallel machines 
but not in the future. Instead it will run on inexpensive PC cluster system or WS 
cluster system. Thus we believe that SQL implementation based on sophisticated 
optimization would be one of reasonable approaches. 

There remains lots of further investigations. Current SQL query could be 
more optimized. In addition we plan to do large experiments. Since our system 
has 100 nodes, we could handle larger transaction database. In such experiments, 
data skew become a problem. Skew handling is one of the interesting research 
issue for parallel processing. We have reported some mining algorithms that 
effectively deal with this problem in [6]. 
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Abstract. We examined the problem of applying association rules in a 
heterogeneous database. Due to heterogeneity, we have to generalize the 
notion of association rule to define a new heterogeneous association rule 
(h-rule) which denotes data association between various types of data in 
different subsystems of a heterogeneous and multimedia database, such 
as music pieces vs. photo pictures, etc. Boolean association rule and 
quantitative association rule are special cases of h-rule. H-rule integrates 
previously defined rule concepts and expands association rule mining 
from single dataset mining to database mining. 



1 Introduction 

The essence of association rule mining problem is to find interesting data cor- 
relations between data items in different attributes. Boolean association rule 
(BAR) [6] is data association between pairs of data (or vector) values. Quantita- 
tive association rule [6] (QAR) and distance based association rule [7] are more 
general and represent set-oriented associations between data. However, they are 
still restricted to association between data in different attributes of a single tab- 
ular dataset. In a heterogeneous and multimedia database (HMD), the interest 
to find data associations should not be restricted to mining rules within a single 
relation. We propose a new and more general form of association rule, h-rule, to 
suit the knowledge discovery task in an HMD. 

Section 2 defines heterogeneous association rule (h-rule). Sect. 3 shows that 
h-rule is a generalization of the BAR and QAR. Finally, Sect. 4 concludes the 
paper. 



2 Heterogeneous Association Rules 



The concept of “related” entries is simple in a relation: two entries are related if 
they are in the same tuple. However, in an HMD, in general, only pointers or IDs 
can exist as (logical and/or physical) connections between entities of different 
relations or data files [4,5,8]. This leads to the following definition: 
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Dfn 2 . 1 : Related entries in an HMD Let d\ be an entry in a relation (or object 
cluster) D\, d2 be an entry in another relation (or object cluster) D2- D\ 
and D2 can be the same, or different relations (object clusters). di and d2 are 
related if one of the following is true: 

— If both D\ and D2 are relations, then either d\ and c?2 are in the same tuple 
of the same relation, or di and c?2 are on tuples t\ and t2, respectively, and t\ 
and t2 are joinable. 

— If at least one of D\ and D2 is an object cluster, a and b are connected by a 
(physical, logical, ID, etc) pointer. 

Note the subtle difference between a value and an entry. A value is a pattern, 
such as a text string, a number, an image pattern, etc. An entry is a cell that 
can hold a value, or values, such as a relational table entry. 

In the case of a single relation, the support of an association rule is simply 
the number of tuples that contain both antecedent and consequent. However, 
the notion “support” actually has two separate meanings, i.e., the data cou- 
pling strength and the usefulness of data implication. We call the first one the 
association support and the second one the implication support. Formally, we 
have: 

Dfn 2 . 2 : Heterogeneous Association Rule (h-rule) Let Ax be a set of attributes 
in a subsystem D\ and Ay be a set of attributes in a subsystem D2. If D\ = D2, 
then condition Ax H Ay = is added. 

A heterogeneous association rule (h-rule) is a data association denoted as 
X Y, where X is a value (if Ax is a set, then X is a vector), or a set of 

values (vectors) in Ax and Y is a value (if Ay is a set, then Y is a vector), or 
a set of values (vectors) in Ay. 

— The association support of the rule is the number of pairs (c?i, c?2) in D\ x D2, 
where d\ is an entry in D\ that contains a value in X and c?2 is an entry 
in D2 that contains a value in Y ; di and c?2 are related. We call di and c?2 
the support nodes of the h-rule. 

— The implication support of the rule is the number of entries c?2 in D2 such 
that the pair (di,d2) exists in D\ x D2, and d\ and d2 are related support 
nodes. 

— The association confidence of the rule is the ratio of the number of pairs 
(di, ^2) (related support nodes) in D\ x D2 to the number of entries d (con- 
tains value in X ) in D\ . 

— The implication confidence of the rule is the ratio of the number of support 
nodes in D2 to the number of entries d ( contain values in X) in D\ . 

si, S2, Cl, and C2 are related quantities as shown in Theorem 1 . 

Theorem 1. For a h-rule X Y, its association support si, implication sup- 
port S2, association confidence ci, and implication confidence C2 
satisfy s\jc\ =32(02. 
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Proof. Let n be the total of entires in Di contains values in X. Then, Ci = si/n 
and C 2 = siln. From these equalities we obtain n = si/ci and n = sijci- 
Hence s\jc\ = S 2 /c 2 - 

As an example of the supports and confidences in h-rules, suppose two rela- 
tions i?i and i ?2 are joinable and these two relations have many-to-many relation- 
ship. Also suppose Oi S Ri, bj S R 2 , i € {1,2,3}, j S {1,2}, {ai,bj) S i?i ex i ?2 
(join of i?i and R 2 ), and | {ai,bj) | = 6 . If { 01 , 02 , 03 } {bi,b 2 } is a h-rule, 

then its association support is 6 , implication support is 2 , association confidence 
is 6/3 = 2, and implication confidence is 2/3=0.67. Note that a confidence is no 
longer a measure of “probability” and may not be less than 1 . 

3 Integration of Association Rule Concepts 

H-rules in a single relation are actually BARs or QARs. This theoretical integra- 
tion of various rule concepts lays the foundation for a uniform association rule 
mining operation across the entire HMD. In the following, Dfn 3.3 is adopted 
from [ 6 ] and Dfn 3.1 is the definition of BAR [2] re-stated using the similar 
formal description as in Dfn 3.3. 

Dfn 3.1: Boolean Association Rule (BAR) Let L = {fi, * 2 , *m} be a set of 

literals, called attributes. Let P denote a set of values (text strings, numbers, 
etc.). For any A C L x P, attribute{A) denotes the set {x \< x,v > ^ A\. Lf 
restriction is added that each attribute occurs at most once in A, then A is called 
a record. Let D be a set of records. 

A record R is said to support X{C LxP), if attribute(RnX) = attribute{X) . 
A BAR X ^ Y in D, where X,Y C Lx P, attribute{X) f] attribute(Y) = 0, 
is an implication from X to Y in set D. The rule has support s if s records in D 
support X LI Y. The rule has confidence c% if c% of the records that support X 
also support Y. 

Dfn 3.2: Tuple Pattern R is a relational schema, r is a relation under the 
schema R. Ax C R. T is called a tuple pattern in r[Ax] means that if r is pro- 
jected to Ax and all duplicates are eliminated, then r[Ajc] has a unique tuple T . 

Theorem 2 . R is a relational schema, r is a relation under the schema R. 
Ax, Ay C R. Ax n Ay = %. X is a tuple pattern in r[Ajc] and Y is a tuple 
pattern in r[Av]. 

A h-rule X Y inr is the same as a BAR X ^ Y [6]. Also, the association 
support Si and the implication support S 2 of the h-rule are equal to the support 
of the BAR; the association confidence ci and the implication confidence of the 
h-rule are the same as the confidence of the BAR. 

Proof. Two entries are related in r means that they are in the same tuple. By 
Dfn 2.2, association support of the h-rule should be equal to the implication 
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support of the h-rule. By theorem 1, association confidence of the h-rule is equal 
to the implication confidence of the h-rule. 

Referring to Dfn 3.1, we can take R = L, r = D, Ax = attribute(X) , 
Ay = attribute{Y) , a tuple in r is a record i? in H, a value X in the antecedent 
of the h-rule is the antecedent value X G L x P in BAR, and a value Y in the 
consequent of the h-rule is the consequent value Y G L x P in BAR. So, in this 
case a h-rule is a BAR. 

Dfn 3.3: Quantitative Association Rule (QAR) Let L = ■■■Am} be o, set of 

literals, called attributes. Let P denote the set of positive integers. Let Ly denote 
the set L X P. 

Let Lji denote the set {< x,l,u > G L x P x P \ I < u, if x is quantitative; 
I = u, if X is categorical} . For any X C Lr, attribute(X) denotes the set 
{a; \< x,l,u> G X}. 

Let D he a set of records R, where each R Q Ly. Each attribute occurs at 
most once in a record. 

A record R is said to support A(C Lr), if A < x,l,u > G A(3 < x,q > G R 
such that I < q < u). 

A QAR X ^ Y, where X,Y C Ln, attribute(X) H attribute{Y) = 0, is an 
implication from X to Y in set D. The rule has support s if s records in D 
support X UY. The rule has confidence c% if c% of the records that support X 
also support Y . 

The intuition of this definition is that for quantitative data, the antecedent X 
and the consequent Y in a QAR are high dimensional “cubes” . The meaning of 
“a record R supports X” is that for each interval < x,l,u > in X, R has at least 
one < x,v > which is contained in < x,l,u >. In reality, the antecedent and 
the consequent of the rule do not have to be in “cubic” shape. However, when 
h-rules are defined within a single relation and this “cubic” shape restriction is 
added, the h-rules are precisely QARs. 

Dfn 3.4.: Interval Pattern S is a relational schema, r is a relation under S. 
Ax C S. T is a tuple under Ax that each of its entry is an interval. T is 
called an interval pattern in r[Ax] means that if r is projected to Ax and all 
duplicates are eliminated, then T contains at least one tuple in i.e., at 

least one tuple t in r[Ajv] that each entry oft is contained in the interval of T 
at the same attribute. 

Theorem 3 . S is a relational schema, r is a relation under S. Ax, Ay C S. 
Ax n Ay = 0. X is an interval pattern in r[Ajf] and Y is an interval pattern 
in r[Ay]. 

An h-rule X Y in r is the same as a QAR X ^ Y . Also, the association 
support Si and the implication support S2 of the h-rule are equal to the support 
of the QAR; the association confidence ci and the implication confidence of the 
h-rule are the same as the confidence of the QAR. 
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Proof. In the definition of the h-rule, two relational table entries are related in r 
means that they are in the same tuple. By Dfn 2.2, association support of the 
h-rule should be equal to the implication support of the h-rule. By theorem 1, 
association confidence of the h-rule is equal to the implication confidence of the 
h-rule. 

Referring to Dfn 3.2, we can take S = L, r = D, Ax = attribute{X) , 
Ay = attribute{Y), a tuple in r is a record R in D, a set X of values in the 
antecedent of the h-rule is the antecedent set X C Lx P in QAR, and a set Y of 
values in the consequent of the h-rule is the consequent set Y C L x P in QAR. 
So, in this case, a h-rule is a QAR. 

4 Summary Remarks 

Mining association rules across an entire database, i.e., database mining, has a 
much stronger impact to business operations than simple dataset 
mining [1,2, 3, 6, 7]. H-rule generalizes association rule mining and includes BARs 
and QARs as its special cases. This unifies data mining and database mining into 
a common theoretical framework and also allows many techniques developed in 
the data mining, such as the Apriori algorithm, to be applied in database mining. 

Database mining is a very complex issue. By defining h-rule, we have only 
built a base for exploring this complex issue. Considerable amount of research 
need to be done for defining various special forms of h-rules in various special 
database structures, for designing new algorithms and strategies to mine h-rules, 
for designing new data structures to store and facilitate the search of mined h- 
rules, and for investigating strategies to use h-rule mining to best benefit business 
operations. 
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Abstract. Association rule is an important contribution to the data 
mining field from the database research community, and its mining has 
become a promising research topic. Traditional association rule is lim- 
ited to intra-transaction mining. Recently the concept of multidimen- 
sional inter-transaction association rule (MDITAR) is proposed by H.J. 
Lu. Based on analysis of inadequencies of Lu’s definition, this paper 
introduces a modified and extended definition of MDITAR, which con- 
sequently makes MDITAR more general, practical and reasonable. 
Keywords: Multidimensional Transaction Database, Data Mining, Mul- 
tidimensional Inter-transaction Association Rule (MDITAR), Definition. 



1 Introduction 

Mining association rules in large databases is one of the top problems in the 
data mining field, which has attracted great attention of the database research 
community. First introduced by R. Argrawal et al for market basket analysis 
[1] , association rules imply the association among different items bought by cus- 
tomers. Nowadys association rules mining is no longer limited to transaction 
databases, it is also applied to relational databases, spatial databases and multi- 
media databases etc [2], [3], [4]. However, the semantic of traditional association 
rule introduced by R. Argrawal has not been changed, i.e, association rules imply 
associations among attribute items within the same data record. Recently H. J.Lu 
et al propose the multidimensional inter-transaction association rule (MDITAR) 
while mining stock transaction data[5]. In Comparison with traditional associa- 
tion rules, MDITAR differs in two aspects: 

1. It implyes associations among items within different transaction records; 

2. It deals with data records having dimensional attributes. 

Obviously, MDITAR is more general than traditional association rule both 
semantically and formally, which makes the traditional association rule one spe- 
cial case of MDITAR. However, there are some limitations of the MDITAR 
definition given by H.J.Lu. 

* This work was supported by the National Natural Science Foundation of China and 
the National Doctoral Subject Foundation of China. 



N. Zhong and L. Zhou (Eds.): PAKDD’99, LNAI 1574, pp. 104-^^^ 1999. 
(c) Springer- Verlag Berlin Heidelberg 1999 



Multidimensional Inter-transaction Association Rule 



105 



In this paper, we first analyze the inadequencies of Lu’s definition, then a 
modified and extended definition of MADITAR is introduced, which is more 
general, practicable and reasonable. The remaining of this paper is organized 
as follows. Section 2 presents an analysis of Lu’s MDITAR definition, then in 
Section 3 our new definition of MDITAR is introduced. Section 4 is the conclusion 
remarks. 



2 Inadequencies of Lu’s definition of MDITAR 



There are 3 major limitations in the definition of MDITAR given by Lu. 



2.1 Only quantitative attributes are considered 

According to Lu’s definition of MDITAR, for an arbitrary transaction record 
Ti = {di,d, 2 , ■ . ■ , dn, Ei),di, d, 2 , ■ . ■ ,dn are treated as equal interval values in 
the A-dimensional attribute space. In such a case, relationship of transaction 
records can be represented by relative differences in dimensional attribute val- 
ues. However, things in the real-life world are not always in such a way. For 
example, if we want to mine the commodity wholesale market data to predict 
price trends, then 2 dimensional attributes will be used, one is the trading time, 
another is the locations of the wholesale markets, let’s say cities. So there is 
a dimensional attribute pair (city, time) for each transaction record, where the 
city attribute is categorical, can not be set with concrete quantitative values and 
involved in arithmetic operation. Therefore, while defining MDITAR, different 
types of dimensional attributes should be considered to make it more general 
and appliable. 



2.2 Association rule is formulated too strictly 

In [5], a MDITAR X ^ Y (X, Y and X UY are frequent item-sets ) is strictly 
formulated by the relative address E-ADDR(XUY). Such a rigid relative address 
constraint on MDITAR very possibly leads to failure of mining some MDITARs 
which are available if with a less strict constraint. For example, providing the 
mining goals are: a(0) 6(1), a(0) 6(2), and a(0) 6(4). For a predefined 

support threshold, it’s very possibly that we can not find the frequent event 
sets {a(0),6(l)}, {a(0),6(2)} and {a(0),6(4)}. However, if we set the formal 
constraint a little loose, for example, put the mining goal as a(0) 6(n) (n < 4), 

then it’s greatly possible that we will discover such rules. Furthermore, if the 
mining goal is set to a(nl) 6(n2)(| nl — n2 \< 4), the success possibility of 
mining this rule is greater than that of the previous cases. On the other hand, 
relative address E-ADDR is not suitable for dealing with categorical attributes. 
Consequently, in order to enhance the practicability of MDITARs, a more flexible 
formal constraint of MDITAR must be introduced. 
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2.3 The support and confidence definitions are not reasonable 
enough 

Based on the support and confidence definitions of MDITAR in [5], the following 
results can be inferred. 

1. Support < confidence; 

2. The implication of support for different frequent item sets or event sets is 
not consistent. 

Here gives an example for demonstration. Supposed there is a transaction 
database TD = {Ti(0, a), T2(3, 6), T3(3, c)}. Based on Lu’s definintion, we can 
get support of {a(0),c(3)} is 100%, and support of {a(0)} is 33%. This is a very 
strange result for {a(0), c(3)} and {a(0)} all occur only one time in the database, 
and the later is a subset of the former, but their support is quite different. The 
cause lies in the definition of support. With such a support definition, the foun- 
dation of the Apriori algorithm can no longer be guaranteed. Nevertheless, [5] 
still applies an extended Apriori algorithm to mine one-dimensional stock data. 
Therefore, new definitions for support and confidence of MDITAR are neces- 
sary in order to eliminate the unreasonable aspect in the existing definition of 
MDITAR. 

3 An improved definition of MDITAR 

Based on above analysis of the inadequencies of the MDITAR definition, here 
we propose a new definition of MDITAR for strengthening its generality, prac- 
ticability and reasonability. Modification and extension focuses on the following 
3 aspects: 

1. Introduce categorical attributes into MDITAR definition; 

2. Adopt association constraint mode to formulate MDITAR; 

3. Define new support and confidence formula. 

Definition 1. Let E = {ei, C 2 , . . . , e„} be literals set or events set, {Ci,C 2 , ■ ■ ■ , 
Ck, Di, D 2 , ■ ■ ■ ,Di} be attributes set, and I + k = N. A transaction database 
consists of a series of transaction records(ci, C 2 , . . . , Cfe, di, c? 2 j • ■ • j di, Ei), where 
Vt(l < i < k){c, G DOM{Ci)), Vj(l < 3 < l){dj G DOM(Dj)), DOM{Ci) 
and DOM{Dj) are domains of Ct and Dj respectively, and Ei G E. Ci{l < 
i < k) is categorical attribute and Dj{l < j < 1) is quantitative attribute 
with an infinite domain. A transaction database with N attributes is called N- 
dimensional transaction database, or multidimensional transaction database (for 
N > 2). 

Here we introduce two types of attributes, C type is categorical attribute, and 
D type is quantitative attribute with an infinite domain. Just as in [5], we take D 
type attribute value as equal interval in its domain. In fact, when all attributes in 
database are categorical, MDITAR can also be treated as traditional association 
rule. 
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Definition 2. For any transaction record Ti or event Cj in a transaction record, 
correspondingly there is a set of attribute values (cu, C2i, ■ ■ ■ ,Cki,du,d2i, ■ ■ ■ , du), 
which is called status of the transaction record Ti or event Ci, and abbreviated 
as s. 

Definition 3. For any transaction set{Ti, T2, . . . , Tk} or event set {ei, 62, , 
Cfc}, there is a set of status {si, S2, . . . , s^} in which certain relationship must 
exist among all elements of the status set. This kind of relationship can be seen 
as constraint on the transaction set or event set. We define such a relationship 
as status constraint mode of the transaction set {Ti,T2, ■ ■ ■ ,Tk\ or event set 
{ei, 62, ... , 6fe}, and denote it by SCM. 

Supposes there is a 2-dimension transaction database with attributes A\ and A2 
which are C type and D type attributes respectively. Now there is a event set 
e = {a(“A”,2),5(“B”,4),andc(“C'”,5)}. Giving 3 SCMs as follows. 

1. a.Ai = “A\b.Ai = “B^\c.Ai = “C”, 6.A2 - 0.^2 = 2 ,c.A 2 ~b.A 2 = 1; 

2. a.Ai = “A” , b.Ai = “B ” , c.Ai = “C ” , 6. A2 — 0.^2 < 3, C.A2 — a.A2 < 3; 

3. max{a.A 2 , 6.A2, C.A2) — min{a.A 2 , 6.A2, C.A2) < 5j£ 

Obviously, the event set e conforms with all the 3 SCMs. However, the constraint 
strength of the 3 SCMs is different. The advantage of SCM over E-ADDR is its 
flexibility and capability of coping with categorical attributes. 

Definition 4. Given an event set e = {ei, 62, . . . , 6fc} and its SCM, and a 

transaction set T = {Ti, T2, . . . , T;}. If there is a minimum subset Tmin of T, 

Tmin = {T^1,T,,2, Tij}{! < ij < 1 ) that satisfies 

1. for every event Cj in e, correspondingly there is a transaction Tj in Tmin such 
that Ci € Tj.Ej] 

2. Tmin conforms with SCM in the same way as e does, then we say T contains 
e in term of SCM, and Tmin is one of T’s minimum subset containing e. 

Definition 5. If a set of events e = {ei, 62, . . . , Cfe} is associative at certain 
frequency in the database, then we call e’s SCM as association-constraint mode, 
and denote it by ACM. 

Definition 6. Supposes there is a fV-dimensional transaction database Dm from 
which MDITARs are mined. The minimum of support and confidence are pred- 
ifined as supmin and cofmin respectively. Then N-dimensional inter-transaction 
association rule X can be represented by a quadruple {X, Y, ACM, sup, cof), 
where 

1 . X,Y and X U Y are frequent event-sets conforming with association con- 
straint mode ACM. and X C\Y = <P-, 

2. sup and cof are support and confidence of A U T, and 

sup =\X\JY \ / \ Dm |> SUPmtn (1) 

cof = sup{X U Y)/sup{X) > cofmin (2) 
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Here | I?Ar | is the cardinality of fV-dimensional transaction database Dm , 
and I X U y I is the total number of minimum subsets containing X\JY m. Dm- 
Clearly, the above definitions of support and confidence is similar to that of 
traditional association rule, which avoids the problems coming with Lu’s defi- 
nitions. And the introduction of association constraint mode makes MDITARs 
mining more flexible and practicable. The miner can choose freely a proper ACM 
according to his interest. 

Based on definition 6, a lemma about MDITAR can be obtained, which forms 
the basis of mining MDITAR with A-priori algorithm. 

Lemma 7. If N -dimensional event-set e = {ei, C 2 , . . . , Cfc} which conforms with 
ACM is a frequent event-set in the N -dimensional transaction database Dm, then 
any subset of e conforming with the same ACM is also frequent event-set in Dm- 

4 Conclusions 

MDITAR is a generalization and extension form of traditional association rule. 
Meanwhile it’s also a new challenge to the data mining community. This paper 
gives a modified and extended definition of MDITAR on the basis of the original 
definition introduced by Lu[5]. The improved definition makes MDITAR more 
general, practicable and reasonable than the old one. Future work will focus on 
developing efficient and effective mining algorithm for MDITARs. 
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Abstract Concept lattice is an efficient tool for data analysis. Mining 
association rules is a important subfield of data mining. In this paper 
we investigate the ability of concept lattice on associate rules and 
present an efficient algorithm which generate the large itemsets 
incrementally on the basis of Godin’s work . 



1 Introduction 

Concept hierarchy has been shown to have many advantages in the field of 
knowledge discovery from large databases. It is convenient to model dependence 
and causality, and provides a vivid and concise account of the relations between 
the variables in the universe of discourse. Concept hierarchy is not necessarily a 
tree structure. Wille et al. propose to build corresponding concept lattice from 
binary relations [1, 2, 3]. This kind of special lattice and the corresponding Hasse 
diagram represent a concept hierarchy. Concept lattice reflects entity-attribute 
relationships between objects. In some practical applications such as information 
management and knowledge discovery, concept lattice has gained good results 
[1, 4]. R.Missaoui et al present algorithms for extracting rules from concept 
lattice [5], R. Godin et al propose a method to build concept lattice and the 
corresponding Hasse diagram incrementally and compare it with other concept 
lattice constructing methods [1]. 

Discovering association rules among items in large datab ases is recognized as 
an important database mining problem. The problem was introduced in [7] for 
sale transaction databases. The problem can be formally described as following 
[ 6 ]. 

Let I = {ij, ij,..., i,„} be a set of literals, called items. Let D be a set of 
transactions, where each transaction T is a set of items such that Tcl. The 
quantities of items bought are not considered. Each transaction is assigned an 
identifier, called TID. Let X be a set of items, A transaction T is said to contain X 
if and only if XcT. An association rule is an expression of the form X^Y, where 
X^0, Yczl and XnY= 0. The rule holds with confidence c if c% of transactions 
in D that contain X also contain Y. The rule has support s in D if s% of D contain 
XuY. 
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The problem of generating association rules is usually decomposed into two 
subproblems. We firstly generate all itemsets whose support exceeds the 
minimum support s, then each above itemset we generate association rules whose 
confidence is not less than the minimum confidence c. It’s noted that the overall 
performance of mining association rules is determined by the first subpr oblem. 

In this paper we address the first subpr oblem using concept lattice based on 
Godin’s work [1]. We argue that our method has several advantages. Firstly, our 
method takes only one pass over the database; secondly, it is incremental; finally, 
it is efficient. 



2 Basic Notions of Concept Lattice 

In this section we recall necessary basic notions of concept lattice briefly, the 
detail description about concept lattice can be found i n [1, 2, 3]. 

Suppose given the context (O, D, R) describing a set O of objects, a set D of 
descriptors and a binary relation R, there is a unique corresponding lattice 
structure, which is known as concept lattice. Each node in lattice L is a couple, 
noted (X', X), where X’g P{0) is called extension of the concept, Xg P{D) is called 
intension of concept. Each couple must be complete with respect to R. A couple 
(X’, X)g P(0)XP(D) is complete with respect to relation R, if and only if the 
following properties are satisfied: 

(1) X' ={x'gO| VxGX,xto'}; 

(2) X = (xgZ) I Vx' G X' , X/?JC' } o 

Note that only maximal extended couple can appear in concept lattice. 

A partial order relation can be built on all concept lattice nodes. Given 
Hj=( X', X) and H 2 =( Y', Y), therefore Hj< Xc Y, and the precedent order 
means Hjis parent of or direct generalization in the lattice. In fact, there is a 
dual relationship between X and X in the lattice, i.e., XcY Y’c X’, and 
therefore Y’c X’. So, concept lattice is two lattices connected together 

in essence. The Hasse diagram of the lattice can be generated by use of the 
partial order relation, if Hj<H 2 and there is no other elements such that 
Hj<H 3 <H 2 there is an edge from Hj to H^. It reveals the 
generalization/specialization relationship between the concepts and could be 
used as an efficient tool for data analysis and knowledge acquisition. 

We would observe that the relationship between item sets can be well 
modelled by concept lattice. We look O as a set of TIDs, Z) as a set of items, R as 
every element of O having the same D. In fact, in our case the content of O 
doesn’t matter. What matters is the cardinal of O. Eor efficiency’s sake we use 
cardinal of O instead. Thus, a node in lattice is denoted by (C, X), where C is the 
cardinal of transaction set, X is their common item set. 

Proposition 1 Eor every node(C, X) if C is bigger than threshold t, then X is 
one of the large item set we look for. 

Because every node of the concept lattice is maximally extended, above 
proposition is obvious. 
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Now, we’ll address the problem of generating lattice, that is, generating the 
large item sets incrementally. 



3 Using Concept Lattice to Generate Large Item Sets 

Some algorithms have been proposed for generating the concept lattice from a 
binary relation. R. Godin et al proposed an incremental concept formation 
algorithm based on Galois lattice [1]. Below, we’ll present an improved version, 
and apply it to incremental large item set generation. 

Suppose given a concept lattice L and a new transaction T to be inserted into 
the lattice. The new lattice after the insertion of T is L'. We say that there are 
three kinds of node in the new lattice L'. One kind remains intact. One kind is 
modified and One kind is new node, which is generated by the interaction of the 
inserted node and nodes in L. Then the questions which node should be modified 
and when should a new node be generated? 

Proposition 2 If node H=(N, T) is to be inserted into lattice L, for each node 
N=(C, X) in L, if XcT, N is upda ted to N=(C+1, X). 

It is easy to understand if we consider that a couple of the lattice must be 
maximal extended. Since T includes X’, the number of Transactions including X’ 
must be increased by 1. 

Proposition 3 if (C, X) = inf {(D, Y) gL| X=YnT)} and there is no node (E, 
X)gL, then a new pair (C=D+1, X=YnT) should be generated. (D, Y) is the child 
of the new node, called generator of new node. 

The proposition says when a node’s intension and new node’s intersection 
result in a new set which doesn’t exists in the lattice, and the node is the minimal 
one, a new node should be generated. 

In addition, the edge in new lattice L' must be upda ted. The generator of new 
node is always a child of the new node, and the generator’s original parent must 
be modified. New node may have another child, which should be another new 
node. The children and parents of intact nodes would not change. The key point 
is the search for parent nodes of new generated node that can help modify the 
edges. The proposition below embodies basic ideas of improvement we propose 
to the algorithm in [1]. 

Proposition 4 The parent of new node (C, X) is a new node or modified node 
(D, Y) such that (D, Y)= Sup{(D, Y) gL' |YcX}. 

We know the parent of new node (C, X) must satisfy YcX according the 
property of concept lattice. It also should be the minimal one. 

Now we introduce the algorithm. The lattice is initialized by elements (0, 
Sup(L)). Sup(L) includes all the items in the transaction database while Inf(L) 
includes all transactions. Sup( L) and Inf(L) stand for lowest upper bound and 
greatest lower bound of the lattice respectively. Note that Inf(L) is not explicitly 
given in the algorithm, as it is only for theoretical interestg. 

Incremental concept and large item set generation algorithm 

Input: Lattice L and item set X, transaction to be added X, threshold t 
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Output: Updated lattice L' and rule set X 
BEGIN 

Mark<— 0 /* Initialize set Mark */ 

FOR each element H in lattice in ascending || Y(H)|| order DO 
/ *Y(H) stands for intension of H**/ 

IF Y(H)cX THEN /* modified node */ 

Add C(H)by 1; 

IF C(H) >t 

IF there is no P inX such that X(P)=X(H) 

Add FI to X 
ELSE 

AddC(P) hy 1 
ENDIE 
ENDIE 

Add H to Mark 

IF Y(H)=X THEN exit FOR ENDIF 
ELSE 

int^Y(H)nX 

IF no such H|^GMark that Y(H,^)=int THEN /* H is generator */ 
Create new node N=(C(F1)+1, int) and add it to Mark 
IFC(H)+l>t addNtoX 
Add edge Fl<— N 

FOR each node M of ancestor of H DO 

/* Search the parent node for new node and modify edges */ 

IF there is MeMark such that Y(M)<zint THEN/* M is parent 

node of N */ 

Add edge N<— M 

IFM is parent of Ft THEN delete edge H<— M ENDIF 
exit FOR 

ENDIF 

ENDFOR 

IF int=Y THEN exit FOR ENDIF 
ENDIF 
ENDIF 
ENDFOR 
END 




Ha sse Diagra m of the example 
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The complexity of the algorithm depends on how many new nodes will be 
generated when a new transaction is inserted. Suppose that transactions in 
average are consisting of k items, at most 2 ^ new node could be generated when a 
new transaction is inserted. The complexity of the algorithm is at most 0(2'^ 
•||U||), where ||U|| is the cardinal of the database and k is a const. We observe 
that in fact new generated nodes are much less than 2'' at most time. 

We use example [8] to illustrate our algorithms. Suppose Tj={a, c, d}, T 2 ={b, c, 
e}, Tj={a, b , c, e}, T^={b, e}. The hasse diagram of the lattice after four node 
inserted is shown above. If the threshold is 2, we get large 3-itemset (2, {b, c, e)), 
2-itemset {(2, {a, c}), (3, {b, e})}, 1-itemset (3, {c}). 



4 Conclusion 

Discovering association rules among items in large databases is an important 
database mining problem, and concept lattice is an convenient tool to data 
analysis and knowledge discovery. We use concept lattice appr oach to model the 
problem of generating large itemsets and present an incremental algorithm to 
build lattice and generate large itemsets. Our method has several advantages. 
Firstly, our method takes only one pass over the database; secondly, it is 
incremental; finally, it is efficient at most time. 
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Abstract. In most data mining applications where induction is used 
as the primary tool for knowledge extraction, it is difficult to precisely 
identify a complete set of relevant attributes. The real world database 
from which knowledge is to be extracted usually contains a combination 
of relevant, noisy and irrelevant attributes. Therefore, pre-processing the 
database to select relevant attributes becomes a very important task in 
knowledge discovery and data mining. This paper starts with two existing 
induction systems, C4.5 and HCV, and uses one of them to select relevant 
attributes for the other. Experimental results on 12 standard data sets 
show that using HCV induction for C4.5 attribute selection is generally 
useful. 



1 Introduction 

The pervasive use of sensor and information technology has resulted in the gen- 
eration of vast amounts of digital data. Making use of this resource is of vital 
importance to the business, medical, manufacturing and many other sectors. 
Converting raw sensor data into useful information for human decision makers 
is one of the driving forces behind research into applications of data mining. 

Inductive learning is an area of research in data mining which describes the 
process of hypothesising a concept from a set of examples. It is the primary 
method for creating or discovering new knowledge from a data set by providing 
concept descriptions which are able to generalize. In order to generalize, a learn- 
ing system needs to contain bias, assumptions without which learning would not 
be possible [1]. These biases can be incorporated in the structure of the model, 
the hypothesis representation language or in the workings of the model during 
hypothesis generation. 

One major form of representation used in inductive learning is the decision 
tree (DT) in a process known as the top-down induction of decision trees. Exam- 
ple algorithms include C4.5 [11], its predecessor IDS [10], and ITI [13]. The other 
major representation from is the “if... then” rule structure. Examples include the 
AQ series [7], CN2 [2], and HCV [15]. 

DT structures as induced by ID 3-like algorithms are known to cause fragmen- 
tation of the data set whenever a high-arity attribute is tested at a node [9]. This 
diminishes the understandability of the induced concept hypotheses. Further- 
more ID3-like DT structures have a tendency to repeat subtrees when expressing 
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disjunctive concepts of the form (attribute 1=A) or (attribute 2=B). This is 
referred to as the replication problem. As a consequence of these two problems 
DT’s tend to grow very large in most realistic problem domains. 

Rule-like structures are a more convenient form to express knowledge. Rules 
are similar to the way human experts express their expertise and human users 
are comfortable with this way of expressing newly extracted knowledge [4] . This 
is an important consideration during expert validation of a large knowledge base 
(KB) which might well be equivalent to tens of decision trees, during debugging 
of the KB and in engendering user acceptance of a KB system. Therefore with 
DT algorithms there is often the added overhead of having to decompile the DT 
into a set of rules. Algorithms directly inducing a set of rules are therefore at a 
distinct advantage as they circumvent the creation of a DT and go directly to 
the creation of the rule set. It is also often the case that a direct rule inducer 
produces a more compact concept description than a DT inducer. 

Hypothesis generation in induction involves searching through a vast (possi- 
bly infinite) space of concept descriptions. Practical systems constrain the search 
space through the use of bias [14]. Bias forces the search to prefer certain hy- 
pothesis spaces over others. One such bias, which has not been given much 
attention is to minimize the number of features in the concept description. This 
paper starts with C4.5 [11], which is commonly recognized as a state-of-the-art 
method for inducing decision trees [3], and HCV [16], which performs rule induc- 
tion without generating decision trees. Experiments are carried out using one of 
these two systems to select relevant attributes for the other. In the next section 
we outline C4.5 and HCV and provide their baseline results on the 12 databases 
for our experiments. In the third section we present the results of our empirical 
investigation of using induction for attribute selection. 



2 C4.5, HCV and Their Baseline Results 

2.1 C4.5 A Decision Tree Inducer & Rule Decompiler 

The heart of the popular and robust C4.5 program is a decision tree inducer. 
It performs a depth- first, general to specific search for hypotheses by recessively 
partitioning the data set at each node of the tree. 

C4.5 attempts to build a simple tree by using a measure of the information 
gain ratio of each feature and branching on the attribute which returns the max- 
imum information gain ratio. At any point during the search a chosen attribute 
is considered to have the highest discriminating ability between the different 
concepts whose description is being generated. This bias constrains the search 
space by generating partial hypotheses using a subset of the dimensionality of 
the problem space. This is a depth-first search in which no alternative strategies 
are maintained and in which no back-tracking is allowed. The final DT built 
therefore, though simple is not guaranteed to be the simplest possible tree. 

C4.5 uses a pruning mechanism wherein the tree construction process is 
stopped if an attribute is deemed to be irrelevant and should not be branched 
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upon. A test for statistical dependency between the attribute and the class 
label is carried out to test for this irrelevancy. The induced DT can be converted 
to a set of rules, by its rule decompiler C4.5rules, with some pruning and the 
generation of a default rule. 

In the case of missing attribute values the unknown values are assumed to be 
distributed in proportion to the relative frequency of these values in the training 
set. Replacement of unknown values is therefore carried out according to this 
assumption. When classifying test data with missing attribute values C4.5 tests 
all branches of this attribute and works out the probability that each value is 
the correct choice. This probability is summed over all classes. This method 
performs well under conditions of increasing incidence of unknown values on the 
data sets used by Quinlan [6]. 

2.2 HCV (Version 2.0) ~ A Rule Induction Program 

HCV avoids constructing a DT by using an extension matrix approach [9] to 
generate a set of conjunctive rules to cover each class in the training set relative 
to all other instances. The extension matrix approach was originally proposed 
by Hong [5] and then extended in HCV by Wu [15,16]. In a revised version HCV 
(Version 2.0) has been extended to be able to deal with both noisy data and 
continuous attributes [17]. HCV (Version 2.0) is the version used in this study. 

The HCV algorithm considers one class of instances in turn (termed the pos- 
itive examples, PE) against all the other classes (termed the negative examples, 
NE). The matrix of NE is termed the negative example matrix (NEM). An ex- 
tension matrix (EM) is constructed by taking a positive example and comparing 
it against each member in the NEM. Each attribute value in the NEM which is 
equal to the corresponding attribute value of the positive example is replaced 
by a flag denoting a dead element which is unable to distinguish between the 
positive example and the NE. Repeating this exercise for all positive examples re- 
sults in a set of extension matrices. Superimposing a group of extension matrices 
results in the formation of a disjunction matrix (EMD). This superpositioning 
is governed by two rules: a dead element flag results where at least one EM has 
a dead element and the original NEM attribute value is retained where no EM 
has a dead element. 

A path is defined as the set of attribute values in the NEM such that there 
is only one value per row and none of the attribute values are dead elements. 
An intersecting group of positive examples occurs when a disjunction matrix 
contains at least one valid path. Such a path corresponds to a conjunctive cover 
for all the positive examples in the EMD but none of the negative ones. Since the 
tasks of finding optimal partitions and extracting optimal conjunctive rules from 
the disjunction matrices are both NP-hard, HCV provides a set of heuristics to 
carry out these tasks. 

HCV has been shown to induce compact rules in low order polynomial time. 
The rules are expressed in the form of variable-valued logic. On a battery of 
data sets from both artificial and real world domains HCV has been found to be 
highly accurate relative to results from ID3-like algorithms [17]. 



Induction as Pre-processing 



117 



2.3 Baseline Results 

The Data The data used in our experiments (see Table 1) throughout this pa- 
per can be divided into two groups. The first group is made up of data with 100% 
nominal attributes. The second group contains data of mixed nominal and con- 
tinuous attributes. All these data sets were obtained from the University of 
California at Irvine machine learning database repository [8]. The majority of 
these data sets are obtained from real world domains and are noisy. 



Table 1. Data Sets Characteristics 



Data Set 


# of 

Instances 


Attributes 


Classes 


Majority 
Class (%) 


Continuous 
Attributes (%) 


Avg ^ of Values 
per Attributes 


Unknown 
Values {%) 


Audiology 


226 


69 


24 


21.00 


0.00 


2.23 


2.00 


Monk 1 


556 


6 


2 


50.00 


0.00 


2.80 


0.00 


Monk 3 


554 


6 


2 


52.00 


0.00 


2.80 


0.00 


Vote 


435 


16 


2 


61.40 


0.00 


3.00 


5.00 


Aus-Credit 


690 


15 


2 


56.00 


40.00 


4.56 


0.65 


Hungarian 2 


294 


13 


2 


64.00 


36.00 


3.00 


20.46 


Imports 85 


205 


25 


7 


32.70 


60.00 


6.00 


1.15 


Lab Neg 


56 


16 


2 


65.00 


50.00 


2.62 


35.75 


Swiss 5 


123 


13 


5 


39.00 


38.00 


3.00 


17.00 


Va 2 


200 


13 


5 


28.00 


38.00 


3.00 


26.80 


Va 5 


200 


13 


5 


28.00 


38.00 


3.00 


26.80 


Wine 


178 


13 


3 


40.00 


100.00 


n/a 


0.00 



These databases have been selected because each of them consists of two stan- 
dard components when created or collected by the original providers: a training 
set and a test set. The standard deviation has been given in some by the original 
database providers. The databases have been used “as is” . Example ordering has 
not been changed, neither have examples been moved between the sets. For each 
database, induction was performed on the training set, and the accuracies listed 
in the following tables are from the test set. 



Test Conditions Table 2 presents the results generated by C4.5, C4.5rules 
and HCV on the databases in Table 1. The second column (^) shows the orig- 
inal number of attributes in each database. # (HCV), # (C4.5) and # (C4.5) 
indicate the number of attributes used in the induction results of HCV, C4.5 
and C4.5rules respectively. % (HCV) is HCV’s accuracy on the test set of each 
database, and Err (%, C4.5) and Err (%, C4.5rules) are C4.5 and C4.5rules’ 
error rates. 

Throughout the experiments the same default conditions were used for all the 
data sets. Obviously fine tuning different parameters in either C4.5 and HCV 
would have achieved higher accuracy rates. This however would have been at 
the expense of a loss in generality and applicability of the conclusions. Default 
conditions were adopted for the two other programs C4.5 and HCV (Version 2.0) 
as recommended by the respective authors. 
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Table 2. Baseline Results 



Data Set 


# 


# (HCV) 


% (HCV) 


# (C4.5) 


Err (%, C4.5) 


^ (C4.5rules) 


Err (%, 
C4.5rules) 


Audiology 


69 


42 


73.08 


14 


15.4 


15 


19.2 


Aus- Credit 


15 


14 


82.50 


9 


20.0 


9 


19.5 


Hungarian 2 


13 


11 


86.25 


10 


20.0 


7 


15.0 


Imports 85 


25 


21 


62.71 


10 


32.2 


10 


32.2 


Lab Neg 


16 


9 


76.47 


2 


17.6 


3 


11.8 


Monk 1 


6 


3 


100.00 


5 


24.3 


6 


0.0 


Monk 3 


6 


4 


97.22 


2 


2.8 


5 


3.7 


Swiss 5 


13 


10 


28.13 


6 


68.8 


5 


71.9 


Va 2 


13 


9 


78.87 


5 


29.6 


5 


28.2 


Va 5 


13 


12 


26.76 


7 


73.2 


8 


71.8 


Vote 


16 


14 


97.04 


2 


3.0 


7 


4.4 


Wine 


13 


9 


90.38 


4 


17.3 


4 


17.3 



3 Empirical Investigation 

3.1 Experiment Design 

This section carries out experiments of using one of C4.5 and HCV to select 
attribute for the other. The following is the experiment design for Section 3.1. 

1. For each database in Table 1, run HCV on the training set of the database. 

2. Extract the attributes (yl) HCV has used in its induction results. 

3. For each attribute A in the original database, if A ^ A, then 

— Comment it out from the dictionary file. 

— Take its values out from the training and test sets. 

4. Run C4.5 and C4.5rules on the changed training and test sets and record 
the results. 

5. Goto Step 1. 

For Section 3.2, C4.5 and C4.5rules are used in Step 2 and HCV in Step 4. As 
in Section 2, default conditions were adopted for both HCV and C4.5, and the 
databases were used “as is” except the changes mentioned in the above design. 

3.2 HCV Induction for C4.5 Arrtibute Selection 

Table 3 provides numerical evaluation of the error rates of the decision trees 
produced by C4.5 and the rules produced by C4.5riiles on the 12 databases, 
after using HCV induction for attribute selection. The baseline results (from 
Table 2) are given before the symbol and results after induction for attribute 
selection are given after — 

Out of the 12 databases, C4.5’s error rate decreases significantly on one 
database (Monk 1), increases slightly on another database (Va 2), and remains 
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Table 3. C4.5/C4.5rules Accuracy after HCV 



Data Set 


# 


# (HCV) 


Errors (%, C4.5) 


Errors (%, C4.5rules) 


Audiology 


69 


42 


15.4 ^ 15.4 


19.2 ^ 19.2 


Aus- Credit 


15 


14 


20.0 ^ 20.0 


19.5 ^ 19.5 


Hungarian 2 


13 


11 


20.0 ^ 20.0 


15.0 ^ 13.8 


Imports 85 


25 


21 


32.2 ^ 32.2 


32.2 ^ 32.2 


Lab Neg 


16 


9 


17.6 ^ 17.6 


11.8 ^ 11.8 


Monk 1 


6 


3 


24.3 ^ 11.1 


O 

d 

T 

O 

d 


Monk 3 


6 


4 


2.8 ^ 2.8 


3.7 ^ 0.0 


Swiss 5 


13 


10 


68.8 ^ 68.8 


71.9 ^ 75.0 


Va 2 


13 


9 


29.6 ^ 31.0 


28.2 ^ 25.4 


Va 5 


13 


12 


73.2 ^ 73.2 


71.8 ^ 71.8 


Vote 


16 


14 


3.0 ^ 3.0 


4.4 ^ 3.0 


Wine 


13 


9 


17.3 ^ 17.3 


17.3 ^ 17.3 



the same on the remaining 10 databases. In the meanwhile, C4.5rules has lower 
error rates on 4 databases (Hungarian 2, Monk 3, Va 2, and Vote), and a higher 
error rate on only one database (Swiss 5). 



Table 4. C4.5/C4.5rules Sizes after HCV 



Data Set 


# 


# 

(HCV) 


^ of Nodes 
(C4.5) 


^ of Rules 
(C4.5rules) 


^ of Tests 
(C4.5rules) 


Audiology 


69 


42 


52 




52 


23 ^ 23 


65 ^ 65 


Aus- Credit 


15 


14 


44 




44 


14 ^ 14 


42 ^ 42 


Hungarian 2 


13 


11 


37 




37 


7^9 


14 ^ 16 


Imports 85 


25 


21 


67 




67 


14 ^ 14 


34 ^ 34 


Lab Neg 


16 


9 


7 




7 


5^5 


7^7 


Monk 1 


6 


3 


18 




35 


14 ^ 11 


28 ^ 22 


Monk 3 


6 


4 


12 




12 


13 ^ 9 


25 ^ 14 


Swiss 5 


13 


10 


25 




25 


7^8 


15 ^ 18 


Va 5 


13 


12 


35 




35 


9^9 


22 ^ 22 


Va 2 


13 


9 


19 




12 


6^7 


11 ^ 18 


Vote 


16 


14 


7 




7 


7^7 


10 ^ 10 


Wine 


13 


9 


9 




9 


6^6 


10 ^ 10 



Table 4 lists the number of nodes in C4.5 trees and the numbers of rules 
and tests (or conditions) in C4.5rules for each database. From this table, the 
number of nodes in C4.5 trees increases on one database (Monk 1) and decreases 
on another (Va 2). In the meanwhile, the complexity of the rules produced by 
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C4.5rules decreases on 3 databases (Hungarian 2, Monk 1, and Monk 3) and 
increases on 2 databases (Swiss 5 and Va 2). 

If we compare C4.5 and C4.5rules’ error rates and description complexity 
before and after using HCV for attribute selection, the gains in Tables 3 and 4 
do not seem to be very significant. However, since the numbers of attributes 
after HCV selection are significant less than that in the original databases, we 
can still claim that the attribute selection by HCV is useful, because it reduces 
the complexity of the original databases and subsequently C4.5 and C4.5rules’ 
execution times. 



3.3 C4.5 Induction for HCV Arrtibute Selection 

Tables 5 and 6 detail the changes in HCV’s accuracy and rule complexity after 
using C4.5 and C4.5rules for attribute selection. 



Table 5. HCV Accuracy after C4.5 and C4.5rules 



Data Set 


# 


# (HCV) 


% (HCV) 


# (C4.5) 


% (HCV) 


^ (C4.5rules) 


% (HCV) 


Audiology 


69 


42 


73.08 


14 


57.69 


15 


57.69 


Aus- Credit 


15 


14 


82.50 


9 


83.00 


9 


83.00 


Hungarian 2 


13 


11 


86.25 


10 


86.25 


7 


83.75 


Imports 85 


25 


21 


62.71 


10 


55.93 


10 


55.93 


Lab Neg 


16 


9 


76.47 


2 


64.71 


3 


64.71 


Monk 1 


6 


3 


100.00 


5 


100.00 


6 


100.00 


Monk 3 


6 


4 


97.22 


2 


97.22 


5 


97.22 


Swiss 5 


13 


10 


28.13 


6 


25.00 


5 


25.00 


Va 2 


13 


9 


78.87 


5 


78.87 


5 


78.87 


Va 5 


13 


12 


26.76 


7 


32.39 


8 


28.17 


Vote 


16 


14 


97.04 


2 


61.48 


7 


97.04 


Wine 


13 


9 


90.38 


4 


73.08 


4 


73.08 



Apart from the Monk 1 database, on which HCV’s results have not changed 
after C4.5 and C4.5rules’ attribute selection, HCV’s rule complexity in terms of 
the numbers of rules and tests decreases on all databases and in the meanwhile 
its accuracy also decreases on most of them. 

If we look back at Table 2, HCV’s baseline results have generally used more 
attributes than C4.5 and C4.5rules. When HCV does not have enough attributes 
to choose from during its induction, it is understandable that it has to sacrifice its 
accuracy, the most important criterion for many practical knowledge discovery 
applications. Therefore, it appears to be a reasonable conclusion that for those 
applications where accuracy is among the most important factors, C4.5 and 
C4.5rules are harmful for HCV’s attribute selection. 
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Table 6. HCV Sizes after C4.5 and C4.5rules: Rules and Tests 



Data Set 


# 


# 

(C4.5) 


^ of rules 
(HCV) 


^ of tests 
(HCV) 


# 

(C4.5rules) 


^ of rules 
(HCV) 


^ of tests 
(HCV) 


Audiology 


69 


14 


59^ 37 


124 ^ 98 


15 


^ 37 


^ 101 


Aus- Credit 


15 


9 


99^ 25 


223 ^ 91 


9 


^ 25 


^ 91 


Hungarian 2 


13 


10 


22^ 17 


64 ^ 46 


7 


^ 13 


^ 25 


Imports 85 


25 


10 


76^ 32 


104 ^ 62 


10 


^ 32 


^ 62 


Lab Neg 


16 


2 


24^ 4 


37 ^ 4 


3 


^ 4 


^ 4 


Monk 1 


6 


5 


8^ 8 


16 ^ 16 


6 


^ 8 


^ 16 


Monk 3 


6 


2 


8^ 5 


16 ^ 6 


5 


^ 7 


^ 14 


Swiss 5 


13 


6 


32^ 18 


85 ^ 55 


5 


^ 16 


^ 36 


Va 2 


13 


5 


31^ 13 


83 ^ 32 


5 


^ 13 


^ 32 


Va 5 


13 


7 


50^ 30 


151 ^101 


8 


^ 33 


^106 


Vote 


16 


2 


37^ 3 


55 ^ 2 


7 


^ 12 


^ 22 


Wine 


13 


4 


58^ 6 


60 ^ 11 


4 


^ 6 


^ 11 



4 Conclusion 

Induction always involves searching through a space of concept descriptions. 
Appropriate selection of attributes can constrain the search space through the 
use of pre-selected attributes. We have experimented in this paper the idea of 
using one induction system to select attributes for another induction system. 
The following conclusions appear to be supported by our experimental results. 

— Attribute selection by a rule induction system like HCV is useful for decision 
tree construction. Even if the gains in predictive accuracy and description 
complexity are not significant, the selection can reduce the complexity of the 
original databases and subsequently the time for decision tree construction. 

— Decision tree construction using C4.5 tends to use a minimal set of attributes. 
If the number of attributes in a decision tree is less than that required by 
a rule induction system, using a decision tree construction system to select 
attributes is harmful for the rule induction system. 
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Abstract. Classifier learning is a key technique for KDD. Approaches 
to learning classifier committees, including Boosting, Bagging, Sasc, and 
SascB, have demonstrated great success in increasing the prediction ac- 
curacy of decision trees. Boosting and Bagging create different classifiers 
by modifying the distribution of the training set. Sasc adopts a different 
method. It generates committees by stochastic manipulation of the set 
of attributes considered at each node during tree induction, but keeping 
the distribution of the training set unchanged. SascB, a combination of 
Boosting and Sasc, has shown the ability to further increase, on aver- 
age, the prediction accuracy of decision trees. It has been found that the 
performance of SascB and Boosting is more variable than that of Sasc, 
although SascB is more accurate than the others on average. In this 
paper, we present a novel method to reduce variability of SascB and 
Boosting, and further increase their average accuracy. It generates mul- 
tiple committees by incorporating Bagging into SascB. As well as im- 
proving stability and average accuracy, the resulting method is amenable 
to parallel or distributed processing, while Boosting and SascB are not. 
This is an important characteristic for datamining in large datasets. 



1 Introduction 

To increase the prediction accuracy of classifiers, classifier committee^ learning 
techniques have been developed with great success [2,3,4,5,6,7,8,9,10,11]. This 
type of technique generates several classifiers to form a committee by using a sin- 
gle base learning algorithm. At the classification stage, the committee members 
vote to make the final decision. 

Bagging [5] and Boosting [2,3,6,10,12], as two representative methods of this 
type, can significantly decrease the error rate of decision tree learning [3,4,11]. 
They repeatedly build different classifiers using a base learning algorithm, such 
as a decision tree generator, by changing the distribution of the training set. 
Bagging generates different classifiers using different bootstrap samples. Boost- 
ing builds different classifiers sequentially. The weights of training examples used 

^ Committees are also referred to as ensembles [1]. 

N. Zhong and L. Zhou (Eds.): PAKDD’99, LNAI 1574, pp. 123—132, 1999. 

(c) Springer- Verlag Berlin Heidelberg 1999 
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for creating each classifier are modified based on the performance of the previous 
classifiers. The objective is to make the generation of the next classifier concen- 
trate on the training examples that are misclassified by the previous classifiers. 
The main difference between Bagging and Boosting is that the latter adaptively 
changes the distribution of the training set based on the performance of previ- 
ously created classifiers and uses a function of the performance of a classifier as 
the weight for voting, while the former stochastically changes the distribution 
of the training set and uses equal weight voting. Although Boosting is generally 
more accurate than Bagging, the performance of Boosting is more variable than 
that of Bagging [4,11]. 

As an alternative approach to generating different classifiers to form a com- 
mittee, Sasc (Stochastic Attribute Selection Committees) [13] builds different 
classifiers by modifying the set of attributes considered at each node, while the 
distribution of the training set is kept unchanged. Each attribute set is selected 
stochastically. Experiments show that Sasc, like Boosting, can also significantly 
reduce the error rate of decision tree learning [13]. In addition, Sasc is more 
stable than Boosting [13]. 

Sasc [13] is a minor variant of a class of committee learning algorithm that 
learns a committee by randomizing the base learning process [1,7,8, 9]. While 
Sasc has not been directly compared with these alternatives, comparisons of 
reported results suggest that its performance is comparable to that of others. 

Base on the observation that both Boosting and Sasc can significantly in- 
crease the prediction accuracy of decision trees but through different mech- 
anisms, we developed another technique to further improve the accuracy of 
decision trees [14]. The new approach is called SascB (Stochastic Attribute 
Selection Committees with Boosting) , a combination of the Boosting and Sasc 
techniques. SascB has shown the ability to outperform, on average, either 
Sasc or Boosting alone in terms of lower error rate. However, like Boosting, 
SascB is more variable than Sasc, as the Boosting component is the driving 
mechanism in the SascB procedure. Sasc and Bagging are amenable to par- 
allel and distributed processing while SascB and Boosting are not, since the 
generation of each committee member, a classifier, is independent for the former 
while it must occur sequentially for the latter. 

In the light of the findings mentioned above from the previous studies on com- 
mittee learning, in this paper, we present a novel approach, namely SascMB 
(Stochastic Attribute Selection Committees with Multiple Boosting), to im- 
proving the stability and average accuracy of SascB and Boosting. It generates 
multiple subcommittees by incorporating Bagging into SascB using the multi- 
boosting technique [15]. We expect that splitting one committee into multiple 
subcommittees, with each subcommittee being created from a bootstrap sam- 
ple of the training set, can reduce the variability of Boosting and SascB, since 
the Boosting process is broken down into several small processes. In addition, 
we expect that introducing Bagging can further improve the accuracy, since it 
increases the diversity and independence of committee members. At the same 
time, the new algorithm is amenable to parallel and distributed processing. 
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2 Boosting, Bagging, Sasc, and SascB 

Since the SascMB technique is a combination of Bagging and SascB which 
is, in turn, a combination of Boosting and Sasc, we briefly discuss Boosting, 
Bagging, Sasc, and SascB in this section. The classification process of them is 
presented in Section 2.5, since it is the same for all of them. 

2.1 Boosting 

Boosting [4,10,11,12] is a general framework for improving base learning algo- 
rithms. The key idea of Boosting was presented in Section 1. Here, we describe 
our implementation of the Boosting algorithm with decision tree learning, called 
Boost. It follows the Boosted C4.5 algorithm (AdaBoost.Ml) [4] but uses a new 
Boosting equation as shown in Equation 1, derived from [10]. 

Given a training set D consisting of m instances and an integer T, the number 
of trials. Boost builds T primed trees over T trials by repeatedly invoking 
C4.5 [16]. Let Wt(x) denote the weight of instance x in D at trial t. At the first 
trial, each instance has weight 1; that is, wi(x) = 1 for each x. At trial t, decision 
tree Ht is built using D under the distribution wt- The error et of Ht is, then, 
calculated by summing up the weights of the instances that Ht misclassifies and 
divided by m. If e* is greater than 0.5 or equal to 0, Wt{x) is re-initialized using 
bootstrap sampling, and then the Boosting process continues. Note that the 
tree with et > 0.5 is discarded,^ while the tree with ej = 0 is accepted by the 
committee. Otherwise, the weight r<;t+i(a;) of each instance x for the next trial is 
computed using Equation 1. These weights are, then, renormalized so that they 
sum to m. 

W(^t+i){x) = wt{x)exp{{-iy^^^at), ( 1 ) 

where at = — et)/et); d{x) = I if Ht correctly classifies x and 

d{x) = 0 otherwise. 

2.2 Bagging 

The primary idea of Bagging [5] is to generate a committee of classifiers with 
each from a bootstrap sample of the original training set. Given a committee 
size T and a training set D consisting of m instances. Bag, our implementation of 
Bagging, generates T— 1 bootstrap samples with each being created by uniformly 
sampling m instances from D with replacement. It, then, builds one decision tree 
using G4.5 from each bootstrap sample. Another tree is created from the original 
training set. 

2.3 Sasc 

The key idea of Sasc [13] is to vary the members of a decision tree commit- 
tee by stochastic manipulation of the set of attributes available for selection 
at decision nodes. This creates decision trees that each partition the instance 
space differently. We use G4.5 [16] with the modifications described below as the 

^ To make the algorithm efficient, this step is limited to 10 x T times. 
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base classifier learning algorithm in Sasc. When building a decision node, by 
default C4.5 uses the information gain ratio to search for the best attribute to 
form a test [16]. To force C4.5 to generate different trees using the same training 
set, we modified C4.5 by stochastically restricting the set of attributes available 
for selection at a decision node. This is implemented by using a probability pa- 
rameter P. At each decision node, an attribute subset is randomly selected with 
each available attribute having the probability P of being selected. The available 
attributes refer to those attributes that have positive gain values. After attribute 
subset selection, the algorithm chooses the attribute with the highest gain ra- 
tio to form a test for the decision node from the subset. The modified version 
of C4.5 is called C4.5Sas (C4.5 Stochastic Attribute Selection). With P = 1, 
C4.5Sas generates the same tree as C4.5. 

Having C4.5SAS, the design of Sasc is very simple. C4.5SAS is in- 
voked T times to generate T different decision trees to form a committee. As in 
Boosting, the first tree produced by Sasc is the same as the tree generated by 
C4.5. The detailed description of C4.5SAS and Sasc can be found in [13]. 

2.4 SascB 

The combination strategy adopted in SascB [14] employs, when generating 
decision trees, both the stochastic selection of attribute subsets of Sasc and 
the adaptive modification of the distribution of the training set of Boosting. 
SascB uses the same Boosting procedure as Boost except that C4.5 Sas is 
used instead of C4.5 as the base tree generator. As in Boost, the tree with an 
error rate greater than 0.5 is discarded,^ while the tree with no errors on the 
training set is kept. 

SascB can be considered as introducing the stochastic attribute selection 
process into the generation of each tree in the Boosting process. It can also be 
thought of as adaptively modifying the distribution of the training set after the 
generation of each decision tree in the Sasc process. 

2.5 Decision Making in Boost, Bag, Sasc, and SascB 

At the classification stage, for a given example, all of Boost, Bag, Sasc, and 
SascB make the final prediction through committee voting. In this paper, a 
voting method that uses the probabilistic predictions produced by all committee 
members without voting weights is adopted. With this method, each decision tree 
returns a distribution over classes that the example belongs to. This is performed 
by tracing the example down to a leaf of the tree. The class distribution for the 
example is estimated using the proportion of the training examples of each class 
at the leaf, if the leaf is not empty. When the leaf contains no training examples, 
the four committee learning algorithms estimate the class distribution using 
the training examples at the parent node of the empty leaf. The decision tree 
committee members vote by summing up the class distributions provided by all 
trees. The class with the highest score (sum of probabilities) wins the voting, 
and serves as the predicted class for this example. 

® This step is also limited to 10 x T times, where T is the number of Boosting trials. 
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SAScMB(Att, D, P, S, N) 

INPUT: Att\ a set of attributes, 

D-. a training set represented using Att and classes, 

P\ a probability value, 

S: the size of each subcommittee, 

N: the number of subcommittees. 

OUTPUT: a committee, H, consisting of N subcommittees with each 
containing S trees. 

Set r, the number of trials, = S x N 
Set instance weight wi(x) = 1 for each x in D 
H\ := C4.5SAS(Att, D, wi, 1) 
t ■- 2 

WHILE {t <= T) 

{ D' := training cases in D that are misclassified by 

Ut-i) = TW XI 

IF > 1/2) 

ELSE IF {t modulus 5 = 1) 

Reset Wt(x) using bootstrap sampling, i.e., Wt{x) is set at 0 and 
incremented 1 unit every time instance x is selected during 
uniformly sampling \D\ instances from D with replacement 
ELSE IF / 0) 

Calculate wt(x), the weight of each x in Z), from W(t-i)(x) using 
Equation 1 and renormalize these weights so that they sum to |I?| 
Ht ~ C4.5 Sas(AU, D, wt, P ) 

RETURN H 



Fig. 1. The SascMB learning algorithm 

There are three other approaches to voting: The categorical predictions with- 
out voting weights, the probabilistic predictions with voting weights, and the 
categorical predictions with voting weights [13]. These three voting methods 
perform either worse than or similarly to the method that we use here [13,14]. 

3 SascMB: Incorporating Bagging into SascB 

Figure 1 presents the details of the SascMB algorithm. It is resulted from 
incorporating Bagging into SascB. SascMB generates N subcommittees. This 
process can be parallelized. Each subcommittee contains S decision trees built 
using the SascB procedure described in the previous section. The generation 
of the first subcommittee (or one of the subcommittees if using parallel or dis- 
tributed processing) starts from the initial training set D with each training 
instance having the weight 1. The first tree in this subcommittee is the same 
one as that built by C4.5 using the entire training set. The generation of every 
other subcommittee starts from a bootstrap sample of D. 
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At the classification stage, all the members of all the subcommittees gen- 
erated by SascMB vote to predict a class for a given instance. SascMB uses 
the same default voting method as Boost, Sasc, and SascB, since it generally 
performs better than the other three voting approaches [13,14] as mentioned in 
Section 2.5. 

4 Experiments 

In this section, we empirically evaluate SascMB to examine whether incorporat- 
ing Bagging into SascB can increase stability and average accuracy of learned 
committees. It is compared with other committee learning algorithms: SascB, 
Boost, and Sasc. In addition, a multiple Boosting algorithm MB [15] is also 
included in the comparison. C4.5, the base decision tree learning algorithm of all 
these committee learning algorithms, is used as the base line for the comparison. 
The results of Bag are not given in this paper due to the space limit. Note that 
the average error of Bag is higher than that of Boost in our experiments. 

MB is the same as SascMB except that it does not include the stochastic 
attribute selection component. In other words, MB uses the same procedure 
as SascMB for generating multiple decision tree committees, but the former 
uses C4.5 instead of C4.5 Sas. 

4.1 Experimental Domains and Methods 

Forty natural domains from the UCI machine learning repository [17] are used. 
They include all the domains used by Quinlan [4] for studying Boosting. 

In every domain, two stratified 10-fold cross-validations were carried out for 
each algorithm. The result reported for each algorithm in each domain is an 
average value over 20 trials. All the algorithms are run on the same training and 
test set partitions with their default option settings. Pruned trees are used for all 
the algorithms. All Boost, Sasc, SascB, MB, and SascMB use probabilistic 
predictions (without voting weights) for voting to decide the final classification. 
The number of trials (the parameter T) is set at 100 in the experiments for 
Boost, Sasc, and SascB. The subcommittee size and the number of subcom- 
mittees are set at 5 and 20 respectively, resulting in 100 trees in total for MB and 
SascMB. The probability of each attribute being selected into the subset (the 
parameter P) is set at the default, 33%, for Sasc, SascB, and SascMB. 

4.2 Results 

Table 1 shows the error rates of the six algorithms. To facilitate pairwise com- 
parisons among the six algorithms, error ratios are derived from Table 1 and 
presented in Table 2. An error ratio, for example for Boost vs C4.5, presents 
a result for Boost divided by the corresponding result for C4.5 - a value less 
than 1 indicates an improvement due to Boost. To compare the error rates of 
two algorithms in a domain, a two-tailed pairwise t-test on the error rates of 
the 20 trials is carried out. The difference is considered as significant, if the sig- 
nificance level of the t-test is better than 0.05. In Table 2, boldface (italic) font, 
for example for Boost vs C4.5, indicates that Boost is significantly more (less) 
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Table 1. Error rates (%) 



Domain 


C4.5 


Boost 


Sasc 


SascB 


MB 


SascMB 


Annealing 


7.40 


4.90 


5.85 


4.12 


4.67 


5.06 


Audiology 


21.39 


15.41 


18.73 


15.19 


15.88 


15.43 


Automobile 


16.31 


13.42 


14.35 


15.88 


16.10 


16.82 


Breast (W) 


5.08 


3.22 


3.44 


3.08 


3.08 


3.15 


Chess (KR-KP) 


0.72 


0.36 


0.67 


0.36 


0.39 


0.39 


Chess (KR-KN) 


8.89 


3.54 


9.26 


4.09 


5.27 


5.63 


Credit (Aust) 


14.49 


13.91 


14.71 


14.20 


12.82 


12.61 


Credit (Ger) 


29.40 


25.45 


25.10 


25.15 


23.90 


23.50 


Echocardiogram 


37.80 


36.24 


37.01 


39.20 


31.68 


30.47 


Glass 


33.62 


21.09 


25.27 


21.99 


24.33 


21.31 


Heart (C) 


22.07 


18.80 


16.65 


18.63 


18.13 


18.29 


Heart (H) 


21.09 


21.25 


18.88 


21.09 


19.20 


18.53 


Hepatitis 


20.63 


17.67 


18.40 


15.79 


17.12 


17.12 


Horse colic 


15.76 


19.84 


17.39 


19.43 


15.90 


16.04 


House votes 84 


5.62 


4.82 


4.59 


4.25 


3.90 


4.25 


Hypo 


0.46 


0.32 


0.46 


0.36 


0.33 


0.40 


Hypothyroid 


0.71 


1.14 


0.76 


0.98 


0.82 


0.95 


Image 


2.97 


1.58 


2.06 


1.58 


1.77 


1.93 


Iris 


4.33 


5.67 


5.00 


5.67 


5.00 


5.00 


Labor 


23.67 


10.83 


18.83 


9.83 


12.33 


10.50 


LED 24 


36.50 


32.75 


29.00 


32.50 


32.00 


30.50 


Letter 


12.16 


2.95 


3.74 


2.76 


3.45 


3.32 


Liver disorders 


35.36 


28.88 


29.90 


29.47 


26.73 


27.29 


Lung cancer 


57.50 


53.75 


45.83 


53.75 


47.08 


49.17 


LvmphograDhv 


21.88 


16.86 


18.48 


16.50 


16.86 


14.76 


NetTalk(Letter) 


25.88 


22.14 


21.98 


19.91 


21.37 


20.12 


NetTalk(Ph) 


18.97 


16.01 


18.03 


14.60 


15.22 


14.73 


NetTalk(Stress) 


17.25 


11.91 


12.44 


11.30 


12.26 


10.54 


Pima 


23.97 


26.57 


23.76 


26.43 


23.31 


23.18 


Postoperative 


29.44 


38.89 


28.89 


38.89 


32.22 


34.44 


Primary tumor 


59.59 


55.75 


54.72 


55.02 


55.02 


55.30 


Promoters 


17.50 


4.68 


7.09 


4.73 


5.64 


5.64 


Sick 


1.30 


0.92 


1.42 


1.04 


1.10 


1.33 


Solar flare 


15.62 


17.57 


15.70 


17.57 


16.31 


15.95 


Sonar 


26.43 


14.64 


16.32 


13.93 


19.68 


17.79 


Soybean 


8.49 


6.22 


5.42 


5.64 


6.66 


5.49 


Splice junction 


5.81 


4.80 


4.50 


3.65 


4.23 


3.81 


Vehicle 


28.50 


22.40 


25.12 


22.40 


24.00 


23.52 


Waveform-21 


23.83 


18.33 


19.83 


17.50 


18.00 


17.67 


Wine 


8.96 


3.35 


4.48 


1.96 


3.07 


1.68 


average 


19.18 


15.97 


16.10 


15.76 


15.42 


15.09 



accurate than C4.5. The last two rows in Table 2 present the numbers of wins, 
ties, and losses between the error rates of the corresponding two algorithms in 
the 40 domains, and the significance levels of a one-tailed pairwise sign-test on 
these win/tie/loss records. 

From Tables 1 and 2, we have the following observations. 

(1) Incorporating Bagging into SascB can reduce variability of learned commit- 
tees in terms of decreasing the frequency of producing significantly higher error 
rate than the base decision tree learning algorithm. 

While both Boost and SascB obtain significantly higher error rates 
than C4.5 in five out of the 40 domains, SascMB only has significantly higher er- 
ror rates than C4.5 in two domains. The highest relative error increase of Boost 
and SascB over C4.5 is 61% and 38% respectively. It is 34% for SascMB, the 
smallest one among the three algorithms. Note that Sasc and MB are more 
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Table 2. Error rate ratios 



Domain 


Boost 


Sasc 


SascB 


MB 


SascMB 


SascMB vs 


vs G4.5 


Boost 


Sasc 


SascB 


MB 


Annealing 


.66 


.79 


.56 


.63 


.68 


1.03 


.86 


1.23 


1.08 


Audiology 


.72 


.88 


.71 


.74 


.72 


1.00 


.82 


1.02 


.97 


Automobile 


.82 


.88 


.97 


.99 


1.03 


1.25 


1.17 


1.06 


1.04 


Breast (W) 


.63 


.68 


.61 


.61 


.62 


.98 


.92 


1.02 


1.02 


KR-KP 


.50 


.93 


.50 


.54 


.54 


1.08 


.58 


1.08 


1.00 


KR-KN 


.40 


1.04 


.46 


.59 


.63 


1.59 


.61 


1.38 


1.07 


Credit(A) 


.96 


1.02 


.98 


.88 


.87 


.91 


.86 


.89 


.98 


Credit(G) 


.87 


.85 


.86 


.81 


.80 


.92 


.94 


.93 


.98 


Echo 


.96 


.98 


1.04 


.84 


.81 


.84 


.82 


.78 


.96 


Glass 


.63 


.75 


.65 


.72 


.63 


1.01 


.84 


.97 


.88 


Heart(C) 


.85 


.75 


.84 


.82 


.83 


.97 


1.10 


.98 


1.01 


Heart(H) 


1.01 


.90 


1.00 


.91 


.88 


.87 


.98 


.88 


.97 


Hepatitis 


.86 


.89 


.77 


.83 


.83 


.97 


.93 


1.08 


1.00 


Horse colic 


1.26 


1.10 


1.23 


1.01 


1.02 


.81 


.92 


.83 


1.01 


House votes 


.86 


.82 


.76 


.69 


.76 


.88 


.93 


1.00 


1.09 


Hypo 


.70 


1.00 


.78 


.72 


.87 


1.25 


.87 


1.11 


1.21 


Hypothyroid 


1.61 


1.07 


1.38 


1.15 


1.34 


.83 


1.25 


.97 


1.16 


Image 


.53 


.69 


.53 


.60 


.65 


1.22 


.94 


1.22 


1.09 


Iris 


1.31 


1.15 


1.31 


1.15 


1.15 


.88 


1.00 


.88 


1.00 


Labor 


.46 


.80 


.42 


.52 


.44 


.97 


.56 


1.07 


.85 


LED 24 


.90 


.79 


.89 


.88 


.84 


.93 


1.05 


.94 


.95 


Letter 


.24 


.31 


.23 


.28 


.27 


1.13 


.89 


1.20 


.96 


Liver 


.82 


.85 


.83 


.76 


.77 


.94 


.91 


.93 


1.02 


Lung cancer 


.93 


.80 


.93 


.82 


.86 


.91 


1.07 


.91 


1.04 


Lympho 


.77 


.84 


.75 


.77 


.67 


.88 


.80 


.89 


.88 


NetTalk(L) 


.86 


.85 


.77 


.83 


.78 


.91 


.92 


1.01 


.94 


NetTalk(P) 


.84 


.95 


.77 


.80 


.78 


.92 


.82 


1.01 


.97 


NetTalk(S) 


.69 


.72 


.66 


.71 


.61 


.88 


.85 


.93 


.86 


Pima 


1.11 


.99 


1.10 


.97 


.97 


.87 


.98 


.88 


.99 


Postoper 


1.32 


.98 


1.32 


1.09 


1.17 


.89 


1.19 


.89 


1.07 


Tumor 


.94 


.92 


.92 


.92 


.93 


.99 


1.01 


1.01 


1.01 


Promoters 


.27 


.41 


.27 


.32 


.32 


1.21 


.80 


1.19 


1.00 


Sick 


.71 


1.09 


.80 


.85 


1.02 


1.45 


.94 


1.28 


1.21 


Solar flare 


1.12 


1.01 


1.12 


1.04 


1.02 


.91 


1.02 


.91 


.98 


Sonar 


.55 


.62 


.53 


.74 


.67 


1.22 


1.09 


1.28 


.90 


Soybean 


.73 


.64 


.66 


.78 


.65 


.88 


1.01 


.97 


.82 


Splice 


.83 


.77 


.63 


.73 


.66 


.79 


.85 


1.04 


.90 


Vehicle 


.79 


.88 


.79 


.84 


.83 


1.05 


.94 


1.05 


.98 


Waveform 


.77 


.83 


.73 


.76 


.74 


.96 


.89 


1.01 


.98 


Wine 


.37 


.50 


.22 


.34 


.19 


.50 


.37 


.86 


.55 


average 


.80 


.84 


.78 


.78 


.77 


.99 


.91 


1.01 


.98 


w/t/1 


33/0/7 


32/1/7 


32/1/7 


35/0/5 


33/0/7 


27/0/13 


29/1/10 


19/1/20 


21/4/15 


p . of wtl 


<.0001 


<.0001 


<.0001 


<.0001 


<.0001 


.0192 


.0017 


.5000 


.2025 



stable than SascMB, but they are less accurate than SascMB on average (see 
below, for the discussion). 

(2) Incorporating Bagging into SascB can also reduce the average error rate of 
learned committees. SascMB outperforms Boost, Sasc, SascB, and MB in 
terms of lower error rate. 

All the five committee learning algorithms achieve significant error rate re- 
duction over C4.5 at a level better than 0.0001 using a one-tailed pairwise sign- 
test on the error rates of these algorithms in the 40 domains. Among them, 
SascMB obtains the lowest average error rate 15.09%. The average error rate is 
19.18%, 15.97%, 16.10%, 15.76%, and 15.42% for C4.5, Boost, Sasc, SascB, 
and MB respectively. SascMB also achieves the greatest average relative error 
reduction (23%) over C4.5 among these five committee learning algorithms. 
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A direct comparison shows that the average relative error reduction of 
SascMB over Boost and Sasc is 1% and 9% respectively. A one-tailed sign- 
test suggests that SascMB has significantly lower error rate than Boost and 
Sasc (p = 0.0192 and 0.0017 respectively). The average relative error reduction 
of SascMB over MB is 2%, but a one-tailed sign-test fails to show that this re- 
duction is significant at a level of 0.05. The average error ratio of SascMB over 
SascB is 1.01, although the average error rate of SascMB is lower than that 
of SascB. This is because SascB performs better than SascMB in domains in 
which they have relatively low error rates, and vice versa in domains in which 
they have relatively high error rates. It might be thought a disadvantage of 
SascMB that the average error ratio compared to SascB is greater than 1. 
However, we argue that this is a statistical anomaly, due to SascB’s superior 
performance when C4.5 has lower error rates. Increasing accuracy is as impor- 
tant as decreasing error. The average accuracy ratio, a measure that favors better 
performance at large error rates, of SascMB against SascB (an accuracy for 
SascMB divided by the corresponding accuracy for SascB) is also 1.01. Note 
that the average error rate of SascMB is 0.67 percentage points lower, a con- 
siderable reduction, than that of SascB. 



5 Conclusions 

We have presented a new classifier committee learning method, SascMB, for 
decision tree learning. It generates multiple committees through incorporating 
Bagging into SascB. In the new algorithm, the Boosting process is broken down 
into several small processes with each creating one subcommittee. The Bagging 
component of SascMB further increases the diversity and independence of com- 
mittee members. Our aim is to improve the stability and average accuracy of 
learned committees. Another advantage of SascMB over SascB and Boosting 
is that SascMB is amenable to parallel and distributed processing, which is 
important for datamining in large datasets. 

The results of experiments with a representative collection of natural domains 
suggest that SascMB is more stable than SascB and Boosting. It achieves the 
lowest error rate among the five committee learning algorithms on average in 
the 40 domains under investigation. It also achieves the greatest average relative 
error reduction over the base decision tree learning algorithm among the five 
committee learning algorithms. The experiments show that SascMB can signif- 
icantly outperform Sasc and Boosting on average in terms of lower error rate. At 
the very least, SascMB is as accurate as SascB and MB, while demonstrating 
greater stability and amenability to parallel and distributed processing. 
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Abstract. An attribute is deemed important in data mining if it parti- 
tions the database such that previously unknown regularities are observ- 
able. Many information-theoretic measures have been applied to quantify 
the importance of an attribute. In this paper, we summarize and criti- 
cally analyze these measures. 



1 Introduction 

Watanabe [21] suggested that pattern recognition is essentially a conceptual 
adaptation to the empirical data in order to see a form in them. The form is 
interpreted as a structure which always entails small entropy values. Many of the 
algorithms in pattern recognition may be characterized as efforts to minimize 
entropy [20] . The philosophy of entropy minimization in pattern recognition can 
be applied to related fields, such as classification, data analysis, machine learning, 
and data mining, where one of the tasks is to discover patterns or regularities 
in a large data set. Regularities and structureness are characterized by small 
entropy values, whereas randomness is characterized by large entropy values. 

One may partition the statistical population into smaller populations using 
the values taken by an attribute. Such an attribute is deemed important for 
data mining if regularities are observable in the smaller populations, while be- 
ing unobservable in the statistical population. In other words, if an attribute is 
used for data mining, then the attribute should lead to entropy reduction. The 
well known IDS inductive learning algorithm [16] uses exactly such a measure 
for attribute selection in a learning process. Based on the philosophy of en- 
tropy minimization, this paper examines information-theoretic measures [2,18] 
for evaluating attribute importance in data mining. 

2 Measuring Attribute Importance 

Let X denote a discrete random variable and Xi a value in the domain of X. 
A joint probability distribution is a real-valued function Px over X such that 
0 < Px{xi) < 1 and X)r=i Px{xi) = 1, where n denotes the number of elements 
in the domain of X. We write Px as P if A is understood. Shannon’s entropy 
function H is defined over P as: 

n 

H{P) = -^P(xi)logP(a:0, 

i=l 

N. Zhong and L. Zhou (Eds.): PAKDD’99, LNAI 1574, pp. 133—137, 1999. 
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where P{xi) logP(xi) = 0 if P{xi) = 0. We say Shannon’s entropy is over X and 
write H{P) as H{X) when the distribution P over X is understood. Shannon’s 
entropy is a nonnegative function, i.e., H{X) > 0. It reaches the maximum 
value logn when P is the uniform distribution, i.e., P{xi) = ... = P{xn) = 
The minimum entropy value 0 is obtained when the distribution P focuses on a 
particular value Xj, i.e., P{xj) = 1 and P{xi) = 0, 1 < i < n, i ^ j- 

The conditional entropy, i.e., the difference between joint and marginal en- 
tropies, is given by: 

H{X I Y) = H{X,Y) - H{Y). 

Mutual information can be defined as: 

I{X; Y) = H{X) - H{X\Y) = H{Y) - H{Y\X) = H{X) + H{Y) - H{X, Y). 

That is, the mutual information measures the decrease of uncertainty about X 
caused by the knowledge of Y, and vice versa. It is a measure of the amount of 
information about X contained in Y. This measure is the same as the amount 
of information about Y contained in X, namely, I{X;Y) = I{Y;X). Further- 
more, the amount of information contained in X about itself is obviously H{X), 
namely, I{X\X) = H{X). 

One may view an attribute and a database as a statistical variable taking val- 
ues from its domain and a statistical population, respectively [5,14]. Information- 
theoretic measures quantify relationships between random variables. They can 
immediately be applied for the analysis of databases and the evaluation of the 
usefulness of attributes in data mining [14]. 

One of the main tasks in knowledge discovery and data mining (KDD) is to 
find important relationships, or associations, between attributes. In statistical 
terms, two attributes are associated if they are not independent [II]. Two at- 
tributes are independent if changing the value of one does not affect the value 
of the other. From this standpoint, we comment on the meaning of information- 
theoretic measures in the context of data mining. 

For an attribute (or a set of attributes) X, the entropy value H{X) indi- 
cates the information uncertainty associated with X. An attribute with a very 
large domain normally divides the database into more smaller classes than an 
attribute with a small domain. A regularity found in a very small portion of 
database may not necessarily be useful. On the other hand, an attribute with 
small domain usually divides the database into a few larger classes. One may 
not find regularities in such large subsets of the database. Entropy values may 
be used to control the selection of attributes. It is expected that an attribute 
with middle range entropy values may be more useful. Similar ideas have been 
used successfully in information retrieval [22]. A high frequency term tends to 
have a higher entropy value, and a lower frequency term tends to have a lower 
entropy value. Both may not be good index terms. The middle frequency terms 
tend to be useful in describing documents in a collection. 

The conditional entropy H{Y\X) measures the degree of one-way implica- 
tion or functional dependency of the sets of attributes X and Y . If the func- 
tional dependency X ^Y holds, we conclude that P{yj\xi) is either 1 or 0. In 
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term of conditional entropy, X ^ Y holds if and only if H{Y\X) = 0 [10,14]. 
By the relationships between entropy, conditional entropy, and mutual informa- 
tion, the above condition can be equivalently stated as H{X) = H{X,Y) or 
I{X;Y) = H{Y) [10]. If Y is dependent on X, the partition of the database 
by X and Y is exactly the same as the one produced by X alone. The former 
condition reflects this observation. The latter condition shows that the mutual 
information between X and Y is the same as the self-information of Y. The con- 
ditional entropy function can be used to measure the importance of attributes for 
discovering one-way associations. For a fixed F, one obvious disadvantage of us- 
ing H{Y\X) is that it favours attributes with large domains, namely, attributes 
with high entropy values [16]. 

Mutual information measures the degree of deviation of a joint distribution 
from the independence distribution [21] . It may be used to evaluate the usefulness 
of attributes in finding two-way associations. With a fixed Y, the use of I{X; Y) 
for finding a two-way association is in fact the same as using H{Y\X) for finding 
a one-way association [13,19]. Two sets of attributes X and Y are statistically 
independent if /(X; F) = 0. Equivalently, we can state this condition as H{X) = 
H{X\Y), H{Y) = H{Y\X), or H{X,Y) = H{X) + H{Y). If X and F are 
independent, one cannot use values of X to predicate the values of F, and vice 
versa. In information-theoretic terms, knowing the value of F does not reduce 
our uncertainty about X, and vice versa. 

Conditional entropy and mutual information serve as the basic quantities 
for measuring attribute associations. By combination and normalization, one 
may obtain many information-theoretic measures of attribute importance. In 
summary, the following three groups can be obtained: 

~ Lee [10], Malvestuto [14], Pawlak et al. [15]: H{X \ Y),H(Y \ X); 

Kvalseth [9], Malvestuto [14], Quinlan [16]: I{X]Y)/ H{X),I{X\Y) / H{Y). 

— Knobbe and Adriaans [8], Linfoot [12], Quinlan [16]: /(A;F); 

Malvestuto [14]: I{X;Y)/H{X, F); Kvalseth [9]: 2I{X;Y)/{H{X) + H{Y))- 

Horibe [4], Kvalseth [9]: I{X;Y)/max{H{X), H{Y)); 

Kvalseth [9]: f(A; F)/min(ff( A), H{Y)). 

— Lopez de Mantaras [13], Wan and Wong [19]: H{X \ Y) + H{Y \ X); 

Lopez de Mantaras [13], Rajski [17]: {H{X \ Y) + H{Y \ X))/H{X,Y). 

Measures in the first group are asymmetric while measures in the other two 
groups are symmetric. Measures in the third group are distance measures. One 
can obtain the following relationships between these measures: 



I{X-Y) H{X\Y) 

H{X) H{X) 
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(vii) 



H{X\Y) + H{Y\X) = H{X, Y) - I{X; Y), 
mX;Y) ^ „ A _ H{X,Y) \ 
H{X)+H{Y) \ H{X)+H{Y)J' 

H{X\Y)+H{Y\X) I{X-Y) 

H{X,Y) H{X,Y)- 



They provide additional support for various measures. Furthermore, measures 
of one-way association can be expressed in a general form as different normaliza- 
tions of conditional entropy, while measures of two-way association as different 
normalizations of mutual information [9] . 

In studying main problems for KDD, Klosgen [7] discussed two types of prob- 
lems, namely, classification and predication and summary and description. Kam- 
ber and Shinghal [6] referred to them as the discovery of discriminant and char- 
acteristic rules, respectively. The classification and predication problem deals 
with the discovery of a set of rules or similar patterns for predicting the values 
of a dependent variable. The IDS algorithm [16] and the mining of associate 
rules [1] are examples for solving this type of problem. The summary and de- 
scription problem involves the discovery of a dominant structure that derives a 
dependency. It is important to note that asymmetric measures may be suitable 
for former problem, while symmetric measures may be appropriate for the latter. 

In the study of association of random variables using statistical measures, 
Liebetrau [11] pointed out that many symmetric measures do not tell us any- 
thing about causality. When two attributes are shown to be correlated, it is 
very tempting to infer a cause-and-effect relationship between them. It is very 
important to realize that the mere identification of association does not provide 
grounds to establish causality. Garner and McGill [3] showed that information 
analysis is very similar to analysis of variance. One may then extend the argu- 
ment of Liebetrau [11] to information-theoretic measures. In order to establish 
causality, we need additional techniques in data mining. 



3 Conclusion 

This preliminary study has demonstrated that asymmetric measures quantify 
one-way association and are typically related to conditional entropy, while sym- 
metric measures quantify two-way association and are typically related to mutual 
information. If information theory is to be used to develop a formal theory for 
knowledge discovery and data mining, then the principle of entropy reduction 
and models in which causality can be established [11] warrant more attention. 
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ABSTRACT 

Knowledge discovery from raw data is very important to non-experts and even experts who feel 
difficulty in expressing their skills in machine interpretable forms. However, real world data 
often contain some redundant or unnecessary features, and if they are directly used, the quality 
of the seeking knowledge may be much degraded. Here, a new technique of dynamically 
selecting features is suggested. Contrary to the static feature selection, this scheme selects each 
new feature based on its correlation with the previously selected features. In addition, this 
scheme does not require setting any threshold, which would be too difficult to decide. 
Experiments have been conducted for some real world domains in terms of tree sizes and test 
data error rates. The results show the soundness of this scheme. 

Keyword : machine learning, feature selection, mutual information, data mining 

1. Introduction 

In the past two decades, techniques of knowledge acquisition from raw data have 
been studied. The most notable ones are decision tree approaches like ID3[1] and 
C4.5[2], back propagation methods[3], INDUCE system[4], and logic minimization 
based approaches like R-MINI[5] and R-ESPRESSO[6]. However, the given data 
have been directly used by those systems without any preprocessing. 

The raw data used for inductive learning often contain features irrelevant to the 
decision of the classes to which the data belong. Some features even worsen the 
process of learning in that they sometimes enormously increase the learning time, and 
produce more rules and less accurate recognition rates. Eor this reason, it is important 
to select the subset of the features of the given data such that the performance of the 
learning is optimal. The optimality may be judged by the combination of many 
factors including the learning time, the size of the final knowledge, and the 
recognition rates of data, especially of unseen data. 

2. Previous Works 

The problem FRn-k(Eeature Reduction n-k) is to select the optimal k (<n) out of n 
features. However, the optimal subset selection time is on the order of nCk, which is 
practically impossible for large values of k and n. 

Information-theoretic measures[7] may be applied for the selection of features. 
The mutual information I(C; F) of C and F is measured by how much the uncertainty 
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decreases after the value of F gets known. Statically, k features may be selected with 
largest mutual information. 

Battiti's algorithm[8], MIPS, calculates for each candidate feature F its mutual 
information with the class feature C, and the average of its mutual information with 
each of the already selected features. After scaling the average, the difference 
between the two values is used as the selection measure. 

That is, for each candidate feature F, I ( C ; F) - X | 5 | is 

F'eS 

calculated, and the feature with the maximum value is selected This corresponds to 
the balancing of static and dynamic characteristics. The problem is that the value of 
the balancing factor |3 is too hard to decide. 

3. FGMIFS(Feature Group - Mutual Information based 
Feature Selection) 

3.1 Flattened feature and Feature group mutual information 

The values of the flattened feature (F, S) belong to FX S^X S 2 X ...X Sg, where F is 
the feature the domain of which is F, and S is the set of k features the domains of 
which are Sj, S 2 , ... , Sj,, respectively. 

For example, assume that a new feature (i.e., F) is length the values of which are 
long and short, and two features shape, and color are already selected. Assume further 
the values of shape are rectangle and triangle, and the values of color are red and 
yellow. Then, (rectangle, red, long), (triangle, yellow, short), (triangle, red, long) and 
so on are the values of the flattened feature. Only the values existing in the training 
data are used. 

The new conditional entropy of the class C given the new flattened feature (F, S) is 
given by the equation 2. Here, S is the set of already selected features, and F is an 
unselected (or candidate) feature to be tested for a potential selection. 

Nc 

H(C) = - ^P(c)logP(c) (1) 

H(C|(F,S))= - ^ P(5' )(£ P(c I 5 ' ) log P(c I 5 ')) (2) 

S'eS c=l 

S' = the set of feature values of ( F , S) 

Finally, the mutual information GI(C; (F,S)) of the flattened feature (F,S) is defined 
as the difference between H(C) and H(C|(F,5)), and called the feature group mutual 
information between the feature F and the group S of features. Therefore, the feature 
F is selected among all candidate features such that GI(C; (F,S)) in equation 3 is 
maximum. 

GI( C ; ( F , S )) = H ( C ) - H ( C | ( F,S)) ( 3 ) 
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3.2 Selection of the first feature 

The mutual information has a strong bias for the features with many values. For 
example, the resident id, not useful for the disease classification, has so many values 
that the feature will surely produce the maximum mutual information and be selected 
first. This will much degrade the quality of the feature selection, and tend to produce 
unnecessarily many rules. For that reason, the first feature is selected using the gain 
ratio[9] which is defined as the mutual information divided by the split information. 

3.3 FGMIFS algorithm 

FGMIFS selects the first feature as the one with the largest gain ratio. Afterwards, the 
next feature to select is decided as the one with the maximum group information in 
that the group information is judged between the flattened feature and a candidate 
feature. The process is repeated until k features are selected. This dynamic feature 
selection based on the flattened feature concept does not try to balance the static and 
dynamic characteristics because it already considers the relation with all the selected 
features. The concrete algorithm is as follows. 



Step 1: 


FS <r- initial set of n features; 




5 <— { }; 


Step 2: 


for each feature F g FS, 




Compute I(C;F) / SI(F); 


Step 3: 


Choose F from FS 




That maximize I(C;F) / SI(F); 




FS<^FS -{¥}■, 




S<^Svj{¥}- 


Step 4: 


Repeat until | S 1 = k 




Choose F from FS 




That maximize GI(C; (F,5)); 




FS<^FS-{¥}- 




S<^Svj{¥}- 



FGMIFS Algorithm 

4. Experiments 

The FGMIFS algorithm was compared with the Battiti's MIFS algorithm and static 
algorithm, based on the error rates by C4.5rules system and the tree sizes by C4.5 
system. The MIFS algorithm was applied with the threshhold |3 = 1. The static 
algorithm corresponds to the MIFS with |3 = 0. Some results are below. 

4.1 LED display data 

The 2,000 training data with 10% error for each feature(i.e., bit) and 1,000 test data 
were used in the 10-class LED display domain[10]. 17 out of 24 features are 
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irrelevant ones filled with random numbers. <Fig. 1> shows that the set of about 7 
features with small tree sizes and small error rates can be successfully selected. 




1 3 5 7 1 0 1 6 20 22 



Fig. 1 Error rates by C4.5 rules for LED display test data 

4.2 Mushroom data 

In the 24-feature/2-class mushroom domain[10], the 2,822 training data and 2,822 test 
data without any missing features were used. EGMIES results look steadily best. 



-•-STATIC 

-•-MIPS 

-•-FG-MIFS 



Fig.2 Error rates by C4.5 for mushroom test data after each algorithm is used. 

4.3 KRKPA7 data 

KRKPA7 data[10] contain 36 features and two classes, win and nowin. Eor this 
experiment, 2,338 training data and 584 test data were used. 

In this domain, EGMIES works better than MIES in terms of the recognition rates. 
However, EGMIES produces the bigger trees. The reason may be that MIES could 
prune even the necessary branches, resulting in the lower recognition rates. 

5. Conclusion and Future research 

Through experiments, EGMIES has been shown to be a very effective feature 
selection scheme compared with the static algorithm or Battiti's dynamic algorithm. 




1 3 5 7 9 11 15 17 
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1 4 7 10 13 16 19 22 25 33 



Fig.3 Error rates by C4.5rules for krkpa7 test data after each algorithm is used. 

This scheme utilizes the concept of the flattened attribute and eliminates setting the 
threshold, resulting in a stable performance. 

The automatic selection of an optimal k may be applied on top of this scheme, and 
preliminary results already show the usefulness of FGMIFS. Flattening all the 
selected features may be somewhat loosened such that some subset of selected 
features may be applied or grouping selected features may be used. 
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Abstract. One of the approaches to generate a good decision tree is pre- 
processing the data to improve its description. There are many researches 
on data pre-processing such as attributes generation and attributes se- 
lection methods. However, most of them are based on logic programming 
so that it takes much run time. Additionally, some of them need a priori 
knowledge. These are disadvantage for the data mining. We propose a 
novel data driven approach that knowledge on the relevance of attributes 
are generated as association rules from the data, so a priori knowledge 
is not necessary. In this paper, we present the method and clarify its 
feature. The effectiveness of our method as data mining one is evaluated 
through experiments. 



1 Introduction 

In general, it is difficult to generate a good decision tree that is small size and 
high prediction accuracy. The reason is that the data always don’t include the 
adequate attributes and their values. This is why it is necessary to pre-processing 
the data, i.e., adding new attributes to the data and improving its description. 
Many of researches have done on such data pre-processing for decision tree. 
Nevertheless, many of the proposed pre-processing methods are based on logic 
programming. Moreover they need a priori knowledge on the attributes and their 
relevance[Lavrac]. These features are disadvantage as a data mining method. 

In this paper, we propose a novel data pre-processing method for decision 
tree. In this method, the knowledge on attributes is generated from the data 
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SO that the analysts need no a priori knowledge on data and its attributes. 
Additionally, the method employs the apriori algorithm that works fast even if 
the data size is large [Agrawal], and succeeds to make the run time short. 

2 The Proposed Method 

Our proposed method is composed by five parts: (1) data transformation, (2) 
generating association rules of attributes, (3)generating the candidates of new 
attribute, (4) evaluation of the candidates, and (5) improving data description. 

Data transformation: First, we transform the data description of training 
data V to transaction data VN . Each of the data datai in V consists of a set of 
attribute values Wip, . . . , Vi^n for explanation attributes Ai, . . . , A„ and a class c, 
for the class attribute C as follows. 



datai =< Vi^i, ... ,Vi^n,Ci > ( 1 ) 

A transaction data transi in T>M is composed by items such as “Aj = and 
“C = Ci \ and each item is a pair of an attribute and its value in T> as follows. 

tranSi =< Ai = . . . , A„ = C = Ct > (2) 

Generating association rules of attributes: Second, the knowledge of rele- 
vance of attributes and class is generated as association rules from the transac- 
tion data. In this method, we need the association rules that satisfy the next two 
conditions: (l)the condition part doesn’t include item corresponding to class at- 
tribute and (2) the consequent part includes unique item corresponding to class. 
Such association rule described as follows. 

Rule Ri : if Ai^^ = 'Cq , . ■ ■ , Ai^ = Vi^ then C = Ci 

support value : sup{Ri), confidence value : conf{Ri) (3) 

The support value of an association rule is that the ratio of the number 
of data that condition part of the rule matches to the number of data in the 
data set VM. The confidence value of an association rule is that the ratio of 
the number of data that both of the condition part of the rule matches to the 
number of data that condition and consequent part of the rule matches. This 
association rule Ri implies that the class of a data is the consequent part’s one 
if the data includes the same attribute set in the condition part in many cases. 
For the association rule generation process, proposed method employs apriori 
algorithm that can generates fast even if the data size is large. This makes the 
run time of proposed method enough short as data mining method. 

Generating candidates of new attributes: Third, the candidates of new 
attributes and their attribute values are generated from the set of association 
rules TZ = {R^, . . . , Rm}- At first, each of the rule Ri is a candidate of the new 
attribute. The candidate generated from the rule Ri is defined as follows. 
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Candidate of New Attribute : ANi 

Attribute : A =< , • ■ ■ , Ai^ > 

Attribute Value : V = {Vi,V2} 

Vi =< > 

V2=K (4) 

The attribute value Vi is defined as the vector pattern of attribute values in 
the condition part of the association rule such as Vi =< Vi^, . . . ,Vi^ >. Addi- 
tionally, the attribute value V 2 is add for the attribute value pattern that doesn’t 
match any attribute values V\ from the association rules. The candidates ANi 
and ANj that includes the same attributes A are merged into one candidate of 
new attribute. The attribute value of the merged candidate is the union set of 
the attribute values V for ANi and ANj. By this way, the set of candidates of 
new attribute AN = {AN^, . . . , AN^} is generated from the TZ. 

Evaluation of the candidates: Fourth, the usefulness of new attributes candi- 
dates ANj € AN in the decision tree are evaluated. We use the information gain 
criterion, which is used in decision tree algorithm such as IDS [Quinlan]. The in- 
formation gain Gain(ANj) for each candidate of a new attribute candidate ANj 
is defined as follows. 



Gain{ANj) 




i=l 



VI-1 

- {-conf, log 2 confi 



2=1 



(1.0 - confi) log2(1.0 - conf^)/{\G\ - 1)} 



Vl-i 

1 . 0 — supi 



2=1 



l0g2 |C| 



where 

N : number of data 

n(ci) : number of training data that the class is Ci 

supi, confi ■ support and confidence value ofi?that are merget into ANf5) 



This information gain estimated approximately in the case that the ANj splits 
to the top node. You should mention that only indexes calculated in apriori 
algorithm are used in this evaluation, and it enable to evaluate the candidates 
in the short time. 



luiproviug the descriptiou of data: Finally, if the Gain(ANj) bigger 
than 0, the candidate is added to the original data T>. By such pre-processing, 
the description of the data for decision tree is improved. 
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3 Experiments and Discussion 

We present the effectiveness of our proposed method as data-preprocessing 
method in data mining domain through some experiments. The features of the 
data set used in experiments are summarized in table 1. They are picked up 
from UCI ML depository [ML]. As decision tree algorithm, we adopt C4.5 that is 
one of the famous decision tree algorithms [Quinlan]. We evaluate the effective- 
ness of the proposed method from two point of views. One is the improvement 
of decision tree on size and prediction accuracy. Another is run time when the 
method is applied to the large volume data. They are important points for the 
applicability to data mining. 



3.1 Improvement of Decision Tree by Proposed Method 

To confirm the effect of improving decision tree of the proposed method, we 
generate decision trees from two data. One is original data, and another is pre- 
processed data by proposed method. We compare the size and the prediction 
accuracy of the two decision trees. The results are summarized in table 1. 

By applying proposed method to the data, the generated decision tree is im- 
proved in the size and prediction accuracy in many cases. However, the proposed 
method sometimes can’t improve the decision tree. This may be because that 
there are too many attribute values and noisy data so that the association of 
attribute had not extracted as association rules appropriately. We need more 
researches on this problem as future work. 



Table 1. The effectiveness of proposed method in improving decision tree 



data 


^ of data 


^ of els. 


decision tree 
original data 


decision tree 
pre-processed data 


size 


err. rate(%) 


# of att. 


size 


err. rate(%) 


of attr. 


census-data 


32562 


2 


221 


16.9 


8 


221 


16.9 


31 


car 


1728 


4 


173.2 


6.8 


6 


88.7 


4.5 


18 


monkl 


124 


2 


18 


24.3 


6 


8 


0.0 


22 


monk2 


169 


2 


31 


35.0 


6 


35 


24.5 


19 


monks 


122 


2 


12 


2.8 


6 


11 


7.2 


27 


mushroom 


8124 


2 


32.0 


0.0 


22 


32.6 


0.0 


42 


nursery 


12690 


5 


508.3 


2.9 


8 


571.8 


3.0 


IS 


tic-tac-toe 


958 


2 


139.3 


15.2 


9 


23.7 


0.3 


25 


votes 


435 


2 


15.4 


3.7 


16 


13.8 


3.9 


186 


average 






115.0 


10.8 


8.7 


100.6 


6.0 


38.8 



3.2 The Performance of Proposed Method to the Large Volume 
Data 

Next, we investigate the performance of proposed method to the large volume 
data. To make the test data set, we use Monkl data set. A data is picked up 
at random as a source of new data, and add noise to the class of the data 
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with 5% probability. The process repeated as the same time as the required num- 
ber of the test data. In this experiment, we use the PC(OS:Linux,CPU:Pentium 
133MHz,memory:128MB). The experimental results are summarized in table 2. 



Table 2. The run time for pre-processing in the large data 



^ of data 


decision tree 
original data 


decision tree 
pre-processed data 


pre-processing 
run time 
(sec) 


size 


err. rate(%) 


size 


err. rate(%) 


10,000 


90 


5.0 


9 


5.0 


4 


50,000 


79 


4.9 


12 


4.9 


13 


100,000 


79 


5.0 


12 


5.0 


22 


500,000 


90 


5.0 


13 


5.0 


114 



Even if the volume of data becomes larger the run time is increasing at most 
linearly. The calculation cost is much smaller than that of logic programming 
based methods. Furthermore, the effectiveness in improving decision tree is not 
decreased even if the data size is large. These features of proposed method are 
advantage as data mining one. 

4 Conclusion 

We proposed the new data pre-processing method for decision tree using asso- 
ciation rules of attributes, and confirmed the effectiveness of improving decision 
tree and applicability to the data mining. 
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Abstract. The need for sophisticated analysis of textual documents is 
becoming more apparent as data is being placed on the Web and digital 
libraries are surfacing. This paper presents an algorithm for generating 
constrained association rules from textual documents. The user specifies 
a set of constraints, concepts and/or structured values. Our algorithm 
creates matrices and lists based on these prespecified constraints and 
uses them to generate large itemsets. Because these matrices are small 
and sparse, we are able to quickly generate higher order large itemsets. 
Further, since we maintain concept relationship information in a concept 
library, we can also generate rulesets involving concepts related to the 
initial set of constraints. 



1 Introduction 



Finding global patterns or generalizations can provide insight into subsets of a 
textual data set. However, because of the large number of patterns that typi- 
cally exist in a textual database, preconstraining the data space and searching 
for targeted patterns can at times be more fruitful. This paper presents an algo- 
rithm for constrained association rule discovery. The main contributions of our 
algorithm are as follows. First, our algorithm creates matrices and lists based on 
prespecified user constraints and uses them to generate large itemsets. Because 
these matrices are sparse, we are able to quickly generate higher order large 
itemsets. Further, it incorporates knowledge from both the structured compo- 
nents and the unstructured components of the textual data set. Finally, since 
we maintain concept relationship information in a concept library, we can also 
generate rulesets involving concepts related to the initial set of constraints. 

The remainder of this paper is organized as follows. Section 2 presents a 
motivating example from a business document collection. Section 3 describes 
our conceptual framework for generating rulesets from semi-structured data. 
Section 4 details our constrained association rule algorithm. Finally, Section 5 
presents experimental results and conclusions. 
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2 Motivating Example 

Magazine articles, research papers, and World Wide Web HTML pages are tra- 
ditionally considered semi-structured information. Each of these examples con- 
tains some clearly identifiable features, including author, date, publisher and/or 
WWW address. In this paper, we refer to these identifiable features as struc- 
tured attributes. Each document also includes blocks of text that are considered 
unstructured components of the document, e.g. abstract, headings, and para- 
graphs. We define a concept to be any meaningful word, phrase, acronym, or 
name that has been extracted from these components. 

The problem with text data is that limited insight about documents can be 
attained using only the structured document components. A digital library sys- 
tem only provides limited knowledge about the document collection as a whole. 
Eor example, suppose a user is interested in answering the following question: 

During the past year, how have articles in a particular journal been bro- 
ken down in my research area? 

If the user is attempting to answer this question using a typical information 
retrieval system, he will enter the research area of interest ( organizational theory) 
and a journal name (Academy of Management). When the list of articles is 
returned, he will need to find groupings of articles from the previous year and 
semi-automatically categorize them. 

We propose an algorithm that not only answers this question, but also at- 
tempts to provide the user with some additional insight. We define a constraint 
to be any concept or structured value input by the user. Eor example, if the user 
specifies the following constraints (1998, Academy of Management and organi- 
zational theory), then our algorithm might return the following ruleset: 

(A) Academy of Management, 1998 organizational theory : 20% 

(B) Academy of Management, organizational theory, 1998 

population ecology : 20% 

— )■ bureaucracy : 30% 



Rule A involves the constraints specified by the user. This rule states that 20% 
of the articles in the Academy of Management in the 1998 journal are organiza- 
tional theory articles. Rule B involves concepts related to the concept constraints 
(e.g. organizational theory). According to Rule B, in 1998, organizational theory 
articles published in the Academy of Management journal were about population 
ecology 20% of the time and bureaucracy 30% of the time. We refer to these rules 
as constrained association rules since they are based on a set of prespecified 
constraints. In order for Rule B to be generated, we must maintain information 
about relationships that exist between the concepts, both the strength of the 
relationship and the type of relationship (e.g. synonym, broad-narrow, etc). By 
maintaining concept and structured value data, we can quickly extract patterns 
from a large document space. 
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3 Definitions and Conceptual Model 

3.1 Association Rules 

Formally, [1] defines an association rule in transactional databases to be an 
expression of the form X ^ Y, where X and Y are sets of items or itemsets. 
The support of itemset XY is the probability of joint occurrence of X and Y, 
P{XY). A large itemset is one in which P{XY) is above the minimum support, 
minsup. We will use large itemsets and strong sets interchangeably throughout 
the paper. The eonfidenee of X Y is defined as the conditional probability of 
Y given X, P(Y\X), where mm_con/ represents the minimum confidence. 

The definition of association rules in a transaction database can be extended 
to a semi-structured domain. Specifically, each document can be viewed as a 
transaction and each structured value or concept as an item. Even though we 
can model semi-structured data as a set of transactions, we choose to reformu- 
late this model because text mining has different requirements than traditional 
transaction database mining. First, the number of distinct items is very large, 
and, therefore, interesting rules may have a very low support. Also, documents 
(transactions) are typically long and must be parsed offline to determine the 
meaningful set of concepts and structured values. Finally, online ontologies or 
dictionaries identify semantic relationships between concepts. This additional 
knowledge can be used to efficiently focus and add semantic knowledge to the 
final ruleset. Because of these distinctions, current structured value association 
rule algorithms that scan through all the items in all the transactions are not 
ideal for mining document data. Instead, it can be more advantageous to gener- 
ate rules based on a set of constraints. For preconstrained textual mining, this 
structure facilitates the rule discovery process. 

3.2 Conceptual Model 

In our previous work [5], we proposed a system architecture that attempts to 
provide an infrastructure robust enough to facilitate the discovery of rules from 
semi-structured data sets. In this architecture, concepts are stored in the concept 
library, while structured values are stored in the database. Figure 1 shows some 
example structured value data and a mapping to the documents containing each 
structured value. We choose to associate each document name with a document 
id since storing varying length document names with each structured value uses 
more disk space. The concept library consists of two logical components. One is 
a mapping between concepts and documents containing each concept. Figure 2 
shows examples of this. The other is a graph structure that maintains the rela- 
tionships each concept has to other concepts in the domain, as well as the weight 
of each relationship. Figure 3 shows a portion of the concept library, where each 
concept, Cj, is a node and each relationship a weighted edge. Single directional 
edges point from broader to narrower concepts, while bidirectional arrows exist 
between similar sibling concepts. For every pair of concepts Ca and Cj, the re- 
lationship weight rw{CaCt,) identifies the strength of the relationship. There are 
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Structured 
Values (SV) 


SV IDs 


#. Docs 


Document ID List 


Joe Smith 


SI 


8 


1,3,6, 8, 12, 16,18, 20 










Administrative 
Science Quarterly 


S5 


6 


1,3,8,12,15, 17 


Academy of 
Management 


S6 


7 


2,4, 6, 9, 10, 16, 19 










1998 


S8 


10 


1,3,4, 5, 6, 8, 10, 16,18, 19 



Fig. 1. Structured Value Document Mapping 



different techniques for determining relationship weight. It can be manually de- 
termined by assigning a weight based on a defined relationship in a dictionary or 
ontology. Another approach uses information retrieval techniques to determine 
the weight based on the frequency of appearance of Ca and Ci, in the same docu- 
ment. If rw(CaC{,) is greater than a minimum specified relationship weight, then 
we say that Ca and Ci, have a strong bond. The notion of ‘bonds’ is derived from 
the bond-energy algorithm used in cluster analysis to derive similarities among 
attributes [7]. We use the bond to represent the relationship between two items. 
Bond energy or bond denotes the relationship weight. Document data can be 



ConceptlD 


#, Docs 


Document ID List 


Cl 


10 


1,3,4, 6, 7, 8, 11, 16, 18, 20 


C2 


2 


2,5 


C3 


8 


1,3,6, 8, 11, 16, 18, 20 


C4 


8 


5,7, 8, 10, 12, 15, 16, 18 


C5 


4 


2, 9, 13, 19 


C6 


3 


5,6, 10 


Cl 


4 


2,3,8, 10 


C8 


6 


5, 10, 12, 15, 17, 19 



Fig. 2. Concept Document Mapping 



stored as traditional transaction systems, where the document id corresponds 
to the transaction id and the concepts or structured values map to transaction 
items. Our model differs from this traditional approach. Instead of storing each 
document transaction in the database, we choose to associate transaction ids 
with each item. In other words, for a given item (concept or structured value), 
we maintain a mapping to transaction (document) ids in the concept library and 
the database. Specifically, the problem of discovering large itemsets reduces to 






152 



Lisa Singh et al. 




Fig. 3. Concept Library Data 



the problem of traversing sparse matrices. 

A small amount of literature exists about semi-structured text mining algo- 
rithms [2, 3, 6]. Of those, [6] is the only algorithm that uses prespecified con- 
straints in it. The other works propose unsupervised procedures. As previously 
mentioned, for semi-structured data sets, unconstrained mining is very slow and 
leads to large numbers of rulesets. 

Our previous work in [6] also identified associations among concepts and a 
single structured value. The user specifies one concept C\ , one structured value 
5i, a minimum support min_sup, and a minimum confidence min_conf. The 
algorithm begins by obtaining the document ids of each constraint. The support 
for the potential large itemset P(C'i5i) is calculated. If it is above min.sup, we 
search for large itemsets containing concept-relatives Cr of constraint C\. For 
each concept Cr, if rw{C\Cr) is larger than a minimum specified relationship 
weight, P{C\SiCr) is calculated. Rules are generated for all P{C\SiCr) above 
min_sup. Rules above min_conf are returned to the user. 

To extend this algorithm to allow users to specify multiple constraints, it 
is necessary to introduce new data structures so as to avoid the computation 
of P{CiSjCr) for all combinations of concept constraints, C,, structured value 
constraints, Sj, and concept-relative, Cr- 



4 Constrained Association Rule Algorithm 

We have extended our previous algorithm in the following ways: 

1. We allow users to specify an unlimited number of constraints. 

2. We use two sparse matrices to help us avoid comparing document lists for 
every subset of structured data values and concepts. 

3. When determining large itemsets involving concept -relatives, we do not need 
to calculate the probability of joint occurrence for every concept-relative and 
the current large itemset. Instead we eliminate concept -relatives that clearly 
cannot be a member of a new large itemset. 
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4. Our concept library stores not only parent-child-sibling relationships, but 
others as well, synonym, part-to-whole and is-a. This in turn implies that 
a semantically more accurate set of rules can be generated. It should be 
noted that our concept library is a compilation of multiple extended concept 
hierarchies (ECHs) as proposed in [6]. 

During the development of our proposed algorithm, we will make use of the 
user inputs in Figure 4. Figure 5 highlights the main features of our algorithm. 



STRUCTURED VALUES 


CONCEPTS 


AUTHOR: 

Joe Smith 

PUBLICATION NAME: 


strategy 

finance 

management 


Administrative Science Quarterly 


profit 


Academy of Management 

PUBLICATION DATE: 

1998 

MINIMUM SUPPORT: 


0.1% 


MINIMUM CONEIDENCE: 


60% 


MINIMUM RELATIONSHIP WEIGHT: 


30% 



Fig. 4. User Inputs 



The algorithm begins by obtaining the document ids of all the structured values 
and all the concepts originally requested by the user, where Nc and Ng are the 
number of concepts and structured values specified by the user, respectively. 
The structured value information is obtained from the database, while the con- 
cept information is obtained from the concept library. For our example query 
illustrated in Figure 4, = 4 and Ng = 4. 

For all the specified constraints, C, and Sj, Step 2 verifies that they are large 
1-itemsets by checking the count of the document lists. The concept constraints 
above min_sup are placed in the C-List, a list containing all the large concept 
itemsets. Similarly, the structured value constraints above min.sup, 0.1% for our 
example, are placed in the SV-List, a list containing all the large structured value 
itemsets. 

In Step 3, we generate a structured value maXrix, SV-Bond Matrix, that will 
be used to determine higher order large itemsets. In order to create the matrix, 
we use the items in the SV-List to generate large 2-itemsets. We determine if 
P{SaSb) is above min.sup for all pairs (a, b) in the list. If so, the pair (a, b) is 
added to the SV-List and a 1 is added to the SV-Bond Matrix to indicate a 
strong bond between two structured values. If pair (a, b) is less than min_sup a 
0 is added. 

Notice that we only populate the lower triangle of the matrix. At this stage, 
the sum of each column is calculated. This pair count identifies the number of 





154 



Lisa Singh et al. 



Algorithm: Generalized Constrained Association Rules 



Input: Ci - concept constraints, (i: 1 to AL) 

Sj - structured value constraints, (j: 1 to AL) 

Output: RS - ruleset containing constraints and related concepts 

1. Get document ids for all Ci and Sj 
2 A. for every Ci 

if support (Ci) > min_sup add to C-List 
2B. for every Sj 

if support(Sj) > min_sup, add to SV-List 

3. Create structured value matrix, SV-Bond Matrix 

4. Determine structured value large item sets, put in SV-List 

5. Create concept matrix, C-Bond Matrix 

6. Determine concept large itemsets, put in C-List 

7. Identify large itemsets using lists 

8. for each large itemset, determine concept-relatives 

generate additional large itemsets using concept -relatives 

9. Calculate RS 



Fig. 5. Pseudo-code for Constrained Association Rule Alg. 



strong bonds existing in the column. Figure 6 shows the SV-Bond Matrix for 
our example. Column 5i has the largest pair count. Just based on these counts 
alone, we know that the maximum large itemset size cannot exceed 3 and that 
it must include column 5i. 
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Fig. 6. Example SV-Bond Matrix 



Fig. 7. Example C-Bond Matrix 



The matrix can now be used to generate higher order large itemsets (Step 
4 of algorithm) without first determining the intersections of every subset of 
structured values. Since all subsets of a large itemset must have strong bonds. 
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we can use the information in the SV-Bond Matrix to find these larger sets. 
There are three criteria for determining the large itemsets of size N, where N is 
larger than 2: 

1. The pair count for one of the N columns containing the data values is at 

least N - 1. 

2. Each pair in the set must have a strong bond. 

3. The set must have support above the min_sup. 

We determine large 3-itemsets by using the strong bonds in the SV-Bond 
Matrix. For each set of size 2, we check the pair count for the first item in the 
set. If it is at least 2, we test whether it can be expanded by adding a structured 
value that has a strong bond to all of the elements in the current set. This 
new 3-itemset is a potential large itemset. Once all the potential large itemset 
of size 3 have been found, the joint probability of each potential large itemset is 
calculated. Those sets with a joint probability above min_sup are added to the 
SV-List. This continues until either no new set is found or the pair count for all 
the columns in the matrix is less than N - 1. Usually, the SV-Bond Matrix is not 
dense. Therefore, much work is saved by not testing every combination of the 
input set. 

Based on the SV-Bond Matrix in Figure 6, (SiS^S^) is the only potential 
large 3-itemset. Notice that it is unnecessary to check for any potential large 
3-itemsets beginning with 5s since the pair count for the column is 1. If we 
assume the document collection contains 3000 documents, then P(SiSzSs) = 
3 ^ 3000 = 0.1% and (SiS^Sg) is a large itemset. Because no column has a pair 
count of 3, we know that there are no large 4-itemsets. 

A similar matrix called the C-Bond Matrix is used to identify large itemsets 
involving concepts (Step 5 of algorithm). However, the process is less costly 
because intersections of document id lists is avoided when creating the matrix. 
This results because we have relationship information in the concept library. 

In order to create the matrix, we determine if the relationship weight between 
every pair of concepts in the C-List is above the minimum specified relationship 
weight. If so, a 1 is added to the C-Bond Matrix. Figure 7 shows the C-Bond 
Matrix generated from our example data assuming the minimum relationship 
weight is 0.30. 

Once the C-Bond Matrix is generated. Step 6 finds higher order large itemsets 
using the same approach as that presented in Step 4. One difference is that the 
potential large itemsets do not need to be verified. Each potential large itemset 
is an actual large itemset. Recall that in Step 4 the supports of all the potential 
large itemsets needed to be verified at every level by intersecting the document 
lists. 

This is avoided because relationship weight overrides support. It is a more 
general weight that holds across individual databases within a particular domain. 
Therefore, if every pair of concepts in a set has a strong relationship, the set 
is a large itemset. Because we will be generating mixed rulesets that contain 
both structured values and concepts, we perform a union of the document lists 
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associated with the large concept itemset. Therefore, the support for the large 
concept itemset is 1. In this manner, we ensure that the concept component of 
any mixed ruleset is large in and of itself. All concept large itemsets are added 
to the C-List. 

Based on the C-Bond Matrix in Figure 7, {C\CzCi) is the only large 3- 
itemset. It should be noted, that in order to have a large N-itemset containing 
concepts, all the concepts in the large itemset must form a fully connected sub- 
graph. Returning to Figure 3, we see that Ci, C 3 and C 4 is a fully connected 
subgraph. 

At this stage in the algorithm, we have two lists, one consisting of structured 
value strong itemsets and the other of concept strong itemsets. These lists are 
used to generate mixed large itemsets. We calculate the support by intersecting 
the document lists of the structured value large itemset and the concept large 
itemset and keep the potential large itemset if it is above min.sup. This large 
itemset can now be used in rules. For example, the following large itemsets can 
be generated: CzCiS\S^S%. 

To generate large itemsets including related concepts or themes not prespec- 
ified by the user, concept-relatives with strong bonds must be determined (Step 
8 of algorithm). The purpose of generating these itemsets is to facilitate the 
discovery of new concept sets that might otherwise have been overlooked by the 
user. We generate sets involving concept-relatives by taking the relatives of one 
concept in a large mixed set and checking whether or not all the other concepts 
in the large itemset are related to the concept-relative. If all the concepts in the 
large itemset are related to the concept -relative, a new large itemset composed 
of the current large itemset and the relative is created. This step is repeated for 
each mixed large itemset. 

For our example, suppose we are attempting to find the concepts -relatives 
of the large itemset C' 3 C' 4555 g. Concepts Cq and C 7 are concept-relatives of 
concept C 3 , but only C 7 is also a relative of C4. Therefore, the large itemset 
CzCiC'jSzS^ is created. 

The final step involves generating the actual rules. Similar to other works, 
we specify rules by finding antecedents and consequents above min_conf. An 
additional step we include involves checking the relationship type information 
to attempt to identify a rule with a meaningful semantic relationship. One of the 
rules generated from the large itemset is the following: 67% : C 3 , 

C 4 , 5s , 5g C 7 . Strategy and finance related articles written in Administrative 
Science Quarterly in 1998 focus on budgets 67% of the time. 

5 Results 

5.1 Data Set 

Our data set consists of over 50,000 documents from the ABI/Inform Information 
Retrieval System. ABI/Inform maintains bibliographic information, abstracts, 
and documents from over 800 business related journals. We only use a small 
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subset of the documents in the ABI/Inform IR System. The structured data 
values associated with each article are stored in an Oracle 7 relational database 
on an HP 700 series workstation over an NFS file system. Examples of structured 
value tables include author, publication, and location. 

5.2 Experiments 

During our experiments, both the C-Bond Matrix and the SV-Bond matrix were 
not very full: on average the S-Bond Matrix was 7 % full and the C-Bond Matrix 
10% full. We were interested in determining if a bound existed on the average 
number of relationships a concept has. Therefore, we sampled concepts in Word- 
Net, a general online ontology [4] and found that concepts had an average of 25 
relationships. This implies that most columns in the C-Bond Matrix are sparse. 

Likewise, the sparse nature of the SV-Bond Matrix is also a result of a textual 
data set. For example, multiple publication values cannot ever occur in the same 
document. Further, a typical author publishes with a small number of co-authors 
(150 in a lifetime) in a small number of journals (100 or so). Since a user will 
typically enter a mix of all different types of structured values, a relationship 
bound will exist. 

Our experimental results for execution time of our algorithm averaged over 
10 runs are shown in Figures 8, 9, 10. Figures 8 and 9 show that the overall time 
is quasi-linear with respect to the number of concepts entered and the number of 
structured values entered, respectively. For Figure 8, the number of structured 
values was held constant at 5. Similarly, in Figure 9, the number of concepts was 
also held constant at 5. Figure 10 shows the total running time with respect to 
the total number of 





Fig. 8. Time vs. Nbr. of concepts Fig. 9. Time vs. Nbr. of structured val- 

ues 



5.3 Conclusions 

We have introduced an algorithm for generating constrained association rules. It 
incorporates knowledge from both the structured and unstructured components 
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Fig. 10. Total running time 



of the data set and generates rulesets using concepts related to the original set 
of constraints. It also uses some simple data structures that reduce the overall 
complexity of the algorithm. 

This algorithm has only been tested on one semi-structured data set. We hope 
to continue our testing on other semi-structured data sets. At this stage we have 
not compared this algorithm against those proposed for structured transaction 
databases. We need to determine the sparsity of structured data sets to evalu- 
ate whether or not our approach is a good option in that domain. Finally, the 
rulesets presented here are fully constrained rulesets. We are also interested in 
partially constraining the document space and evaluating performance under 
those circumstances. 
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Abstract. Semistructured data is specified by the lack of any fixed and rigid 
schema, even though typically some implicit structure appears in the data. The 
huge amounts of on-line applications make it important and imperative to mine 
schema of semistructured data, both for the users (e.g., to gather useful 
information and facilitate querying) and for the systems (e.g., to optimize 
access). The critical problem is to discover the implicit structure in the 
semistructured data. Current methods in extracting Web data structure are either 
in a general way independent of application background [8], [9], or bound in 
some concrete environment such as HTML etc [13], [14], [15]. But both face 
the burden of expensive cost and difficulty in keeping along with the frequent 
and complicated variances of Web data. In this paper, we first deal with the 
problem of incremental mining of schema for semistructured data after the 
update of the raw data. An algorithm for incrementally mining schema of 
semistructured data is provided, and some experimental results are also given, 
which shows that our incremental mining for semistructured data is more 
efficient than non-incremental mining. 

Keywords: Data Mining, Incremental Mining, Semistructured Data, Schema, 
Algorithm. 



1 Introduction 

As the more and more applications working successfully, Data Mining in very large 
databases has recently been focused greatly by many database research communities 
because of its promising future in many areas, such as OLAP in insurance safety, 
decision support, market strategy and financial forecast [1], [2]. In the early years, 
much research energy has targeted in the mining of transaction databases, relational 
databases, and spatial databases etc, which generally hold structured data [3], [4], [5], 
[6], [7]. However, as the maturity of Internet develops fast nowadays, the volume of 
data available on-line grows much rapidly too. Generally, data on network or in the 
digital library has no absolute schema fixed in advance, and the structure of data may 
be fairly irregular or incomplete. This kind of data is termed as semistructured data. 
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Semistructured data is self-describing data that has some structure but which is 
neither regular nor known a-priori to the system. In order to facilitate querying and 
optimizing access the Web, mining certain common structure or schema from the 
semistructured data is particularly important. 

The large amount of interesting data in the Web encourages people to try solving 
the problem above. Several effective work such as [8], [9] has made to extract the 
schema of semistructured data in a general way which is independent of the 
application environment, while [13], [14], [15] reached the same goal through 
different paths. S.Nestorov's work in [8] refers to approximately classifying the 
semistructured objects into a hierarchical collection of types to extract the implicit 
structure in large sets of semistructured data. K.Wang in [9] at the first time directly 
raises the problem of discovering schema for semistructured data, a framework and 
the corresponding algorithm for it are proposed. But their efforts in [8], [9] both are of 
the drawback of redundant computation work whose cost is expensive when the 
variances in Web data are so often. The work of Mendelzon's group in Toronto 
University tried to start their step from some applications field such as HTML etc. and 
with the help of cost locality, range expression bounds, finally getting the web data 
structure through the calculus based on virtual graph in [13], [14], [15]. Their idea is 
easy in implementation but is limited in the application field they applied first. As the 
data on the Intemet/Web increasing continuously with the time moves on, mining data 
on Intemet/Web cannot be done once and for all. The mined results need to be recast 
as the raw data grows. In the course of this period, a useful way to update the mined 
results is to take advantage of the already discovered results rather than do the whole 
work from scratch. So naturally, an incremental mining philosophy must be adopted. 
However, to our best knowledge, we still found no literature on incremental mining of 
schema for semistmctured data up to now. 

This paper solves the problem of incremental mining of schema for 
semistmctured data, and an algorithm for this purpose is also presented. The other 
parts of this paper are organized as follows. We first introduce and describe some 
basic concepts in section 2, including the object exchange model (OEM) used for 
representing semistmctured data, schema mining for semistmctured data, and 
incremental mining problem. In section 3, the algorithm for incrementally mining 
semistmctured data is given in detail. Section 4 provides some experimental results 
that show the advantage of incremental mining method. Finally, we summarize the 
paper in Section 5. 



2 Defining the Problem 



In this section, we assume that one description model, say OEM, is applied and the 
corresponding concepts or pre-definitions of schema and incremental mining on 
semistmctured data are also introduced. 
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2.1 Object Exchange Model (OEM) 

We represent semistructured data by object exchange model (OEM), which has been 
proposed for modeling semistructured data in [10], In OEM, semistructured data is 
represented as a rooted, labeled, directed graph with the objects as vertices and labels 
on edges. Every object in OEM consists of an identifier and a set of outgoing edges 
with labels. The identifier uniquely identifies the object; the labels connect the object 
to its subobjects, which describes the meaning of the relationship between the object 
and the subobjects. OEM can be treated as a graph where the nodes are objects and 
the labels are on the edges (object references). If the graph is acyclic, that is to say, a 
tree, then the OEM is acyclic. In this paper, we assume that OEM is acyclic without 
additional explanation. Fig.l illustrates an example of semistructured data modeled in 
OEM. 




Fig. 1. Example semistructured database TD in OEM 



2.2 Schema Mining for Semistructured Data 

For the sake of convenience in discussing the mining schema from semistructured 
data, a transaction database as a collection of semistructured data objects has been 
defined, which are called transactions, i.e., each transaction represents an object 
which consists of a series of subobjects and labels. 

A schema mined from semistructured data is to find a typical structure which 
could match most of the targeted objects in the transaction database. In OEM 
environment, a schema of data object can be represented by a tree, which is called 
schema tree or st. In [9] it is described as tree expression. A Schema tree is made of 
one or more tree branches. Each tree branch is defined as a path from the root object 
node to any subobject node in the tree. We say two branches are distinct if they are 
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not contained or covered by each other. A schema tree with only one branch is called 
size-1 schema tree', and a size-k schema tree consists of k distinct tree branches. 

A transaction database is supposed to exist, and st is a schema tree of some 
transactions. The support of st is the number of transactions t such that st is a schema 
tree of t versus the total number of transactions in the database. As to the user- 
specified minimum support MfNSUP, st is frequent if the support of st is not less than 
MIN SUP, and st is maximally frequent if it is frequent and is not covered or 
contained by any other frequent trees. St is a schema or simply pattern if st is 
maximally frequent. To discover mining schema is to find all patterns in the database. 

Example 1\ In figure 1, choose “person”, as the targeted objects, its total number is 
3. Let MINSUP=2%, then stl={name}, st2= {address}, st3={age}, st4={name, 
address}, st5={name, age}, st6={address, age}, and st7={name, address, age} are all 
frequent schema tree of object “person”, st7 is the only maximally frequent schema 
tree. Stl, st2, st3 are size-1 tree, st4 and st5 are size-2 tree, and st7 is size-3 tree. 

For a database (a collection of semistructured data), an A-Priori alike method is 
used to mining the schema of the database. The process of mining all schemata in the 
database is just as follow. 

1 . Discover all frequent size- 1 schema trees', 

2. Find all frequent size-k schema trees (k>l) repeatedly until there is no more 
any size of frequent schema trees can be found. Similar to the mining of 
association rules} 1 1], before finding size-k schema trees, a candidate set of 
size-k schema trees is created on the basis of the found size-(k-l) schema 
trees', 

3. By getting rid of all weak schema trees which are covered or contained by 
other schema trees to get all maximally frequent schema trees. 

One or more consecutively connected labels may be included in a size-1 schema 
tree. Finding the size-1 schema trees is the most time-consuming phase because a lot 
of cases of combination of different labels and the different numbers of labels 
involved in the combination must be considered. The key of mining schema 
efficiently is to construct a minimum candidate set of schema trees. 



2.3 Incremental Mining Probiem 

Within the network environment, especially the Web background, all kinds of data are 
growing continuously and rapidly. How to update the already mined results is an 
unavoidable problem challenging the data mining community. One possible way to 
the update problem is to re-run the schema-mining algorithm on the whole updated 
data from scratch. However this approach is not efficient for it does not utilize the 
already mined results. On the contrary, incremental mining approach makes use of the 
mined results, and at the same time, focuses on mining the newly added data. In this 
paper, we adopt a method similar to that in [12], but which is used for incremental 
maintenance of association rules, to mine schema of semistructmed data efficiently. 

Suppose TDB be a transaction database on which data mining is operated, and TE 
be the set of maximally frequent schema trees in TDB, D be the number of 
transactions in TDB. After some update activities, an increment tdb of new 
transactions is added to the original database TDB. The size of tdb is d. The essence 
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of incremental mining problem is to find TE, that is, the set of maximally frequent 
schema trees in TDBKJtdb. 

Those notations will be used in the remaining of this paper as follows. F* is the 
set of all size-k frequent schema trees in TDB, and F*' is the set of all size-k frequent 
schema trees in TDB'Utdb. C* is the set of size-k candidate schema trees. Moreover, 
X.supporto, X.supportj, and X.supporto^ represent the support counts of a schema 
trees X in TDB, tdb and TDBsjtdb , respectively. 



3 The Incremental Mining Algorithm for Semistructured Data 

After the database update activities have finished, two consequences may be 
triggered. Some frequent schema trees in TDB may become not frequent any more in 
TDB'utdb because of the increasing of the total number of transactions. Thus, these 
schema trees are called losers; On the other hand, some schema trees which are not 
frequent in TDB may turn to being frequent in TDBvjtdb. These schema trees are 
called winners. 

Consequently, the incremental mining process is composed of two major phases: 

1 . Getting rid of the losers from TE by scanning the incremental tdb ; 

2. Discovering the winners from TDB<utdb by checking TDB. 

We present three lemmas which are useful in improving the efficiency of mining 
the frequent schema trees for the updated database. According to the definition of 
frequent schema trees, it is very easy to prove these lemma. So we leave out any 
further proofs of them here. 

Lemma 1 A frequent size-k schema tree in F* containing this schema tree can not 
be a winner in the k-th iteration, if a size-(k-l) schema tree is a loser at the (k-l)-th 
iteration. 

Lemma 2 A size-k schema tree in the original size-k frequent schema trees set F* 
is a loser in the updated database TDB'utdb if and only if X.supportpj < st * (D+d). 

Lemma 3 A size-k schema tree not in the original size-k frequent schema trees set 
F* can become a winner in the updated database TDB^Jtdb if and only if X.supportpd 
> st * (D+d). 

A detailed description of the incremental mining algorithm is as follows. 



3.1 The First Step: To Discover Size-1 Schema Trees Set F^‘ in TDButdb 

We divide into 4 steps which are outlined as follows to find frequent size-1 schema 
trees set F, in TDB'utdb 

1 . For all schema trees X e Fj , scanning the incremental data tdb , updating its 
support count X.suuportnj. Once the scan is completed, all the losers in F, are 
found by checking the condition X.support^ < st * (D+d) on all Xe F, 
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(based on Lemma 2). After pruning the losers, the schema trees in F, which 
remain frequent after the update are identified. 

2. For each Te tdb, all size-1 schema trees Xc T which is not in F/, a set C, is 
created to store, in the same scan of step 1 . This becomes the set of candidate 
schema trees and their support in tdb can also be found in the scan. 
Furthermore, according to Lemma 3, if Xe C, and X.supportj < st ♦ d, X can 
never be frequent in TDB'^Jtdb. Because of this, all the schema trees in C,, 
whose support counts are less than st * d, are pruned off. In such a way, a 
very small candidate set for finding the new size-1 frequent schema trees is 
obtained. 

3. One scan is then carried out on TDB to update the support count X.supportoj 
for each Xe C,. By checking their support count, new frequent schema trees 
from Cl are found. 

4. The set of all size-1 frequent schema trees, F,\ is generated by combining the 
results from step 1 and 3. 



3.2 The Second Step and Beyond: To discover size-k(k>1) schema trees 
set F^' in TDButdb 

We summarize the process of finding frequent size-2 schema trees F 2 in TDBKJtdb as 
follows similar to finding F/. 

Like the first iteration, losers in F^ will be pruned out in a scan on tdb. The pruning 
is done in two steps. At first, according to Lemma 1 , some losers in F^ can be filtered 
out without validating them against tdb. The set of losers F/ - F/ have been identified 
in the first iteration. Therefore, any schema tree Xe F^, which has a sub- schema tree 
Y such that Ye F, • F,\ cannot be frequent and are filtered out from F^ without 
checking against tdb. Secondly, a scan is done on tdb and the support count of the 
remaining schema trees in F^ are updated and the frequent schema trees from F^ are 
identified. 

As in the first iteration, the second part at this iteration is to find the new size-2 
frequent schema trees. The key point is to construct a small set of candidate schema 
trees. The candidate set, Q is generated, before the above scan on tdb starts, by 
applying theorem 1 in [9] on F,. The schema trees in F^ are not considered when 
creating Q because they have already been treated in step 1 . The support count of the 
schema trees in Q is added up in the same scan of tdb. Then the schema trees in Q 
can now be pruned by checking their support count. For all Xe Q, if X.supportj < 
St * d, X is eliminated from Q. Based on Lemma 3, all the eliminated schema trees 
cannot be frequent in TDButdb. 

The third step is to scan TDB to update the support count for all the schema trees in 
Q. At the end of the scan, the entire schema trees Xe Q, whose support count 
X.supportoj > St * (D+d), are identified as the new frequent schema trees. 

The set Fj', which contains all the frequent schema trees identified from F^ and Q 
above, are the set of all size-2 frequent schema trees . 
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The same process with size-2 schema trees is applied to the k>2 iterations until no 
frequent schema trees can be found. Then all frequent schema trees sets { F 'l , Fj , 

F*’ }are obtained. 

3.3 The Final Step: To discover the maximally frequent schema trees 

From all frequent schema trees sets { Fj , Fj , F* },by eliminating these schema 
trees which are contained or covered by other frequent schema trees, the remaining 
frequent schema trees make up of the maximally frequent schema trees set 7’F’. 



3.4 The Description of Incremental Algorithm 

Finally, the incremental mining algorithm is presented as follows. 

Algorithm 1: An incremental schema mining algorithm for semistructured data. 
Input: (1) Original and incremental database; TDB and tdb\ 

(2) The already mined schema trees set from TDB: { Fy , F^ ,... , F* }; 

(3) The minimum support threshold: st. 

Output: The maximally frequent schema trees set mined from TDB<utdb:TE. 

Method: 

The 1-st iteration: 

W = F,; C=<j); F,’=^; /* W: winners, C: candidate set */ 

for all T e tdb do /* scan tdb */ 
for all size-1 schema tree Xc T do { 
if Xe W then X.supportj++; else { 

if XgC then { C=Cu {X}; X.supportd=0;} /"' Initialize count, add X into C */ 

X.SUpportd++;} }; 

for all Xe W do { /♦ put winners into F, */ 

ifX.supportod > St * (D+d) then F,'=F,’u{X}; 
for all Xe C do /* prune candidates in C */ 
if X.supportj < St * d then { C=C-{ X}; 
for all T e TDB do { /* scan TDB */ 
for all size-1 schema trees Xc T do if Xe C then X.supportD++; 
for all Xe C do /* put winners into F, */ 
if X.supportod > St * (D+d) then F,'= F,u{ X} ; 
return F ',. /* end of the first iteration */ 

The k-th iteration(k>l): 

/* This program segment is repeated till no frequent schema trees can be found */ 
loop: 

k=2; W= F*; F^' =(j); /* W;winners */ 

/* create the k-size candidates */ 

C=schema_tree_candidate_gen(Ft,/")-F;[ ; /* Implementing the function 

schema_tree_candidate_gen() according to theorem 1 in [9] */ 

for all k-size schema trees Xe W do /* prune the losers from W */ 
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for all s\ze-{y-\) schema trees Y e F*./- F*./ do 
if Yc X then { W=W-{ X};break;} 

for all Te tdb do{ /* scan tdb and calculate support counts */ 
for all X € W such that is contained in T do X.support^++; 
for all X 6 C such that contained in T do X.supportj++; 

} 

for all Xe W do /* add the winners into F* */ 
if X.supportod > st ♦ (D+d) then F*= F*'u{X}; 
for all Xe C do 

if X.supportj < St * d then C=C-{ X}; /* prune the candidate sets */ 
for all T 6 TDB do 

for all Xe C such that contained in T do X.supportn++; 
for all Xe C do /* add the winners into into F* ♦/ 
if X.supportoj > St * (D+d) then F*'= F*'u{ X }; 
ifF*' have at least 2 frequent schema trees then {k++;continue}; 
else break; 

return {F* (k>l)}. /* end of the second step */ 

The last step: /* finding all maximally frequent schema trees */ 
F£'={F;,F,„..,F,}; 
for all X e FF do 
for all Y e FF and X^^Y do { 
if X contains Y then do FF= FF-Y; 
if X is contained by Y then do FF= FF-X; 

}□ 



4 Experimental Results 

In this section we present some preliminary experimental results. While the 
performance (in terms of time) is an important consideration in our work, the main 
focus of the experiments is the quality of the results. To test the efficiency of the 
proposed incremental mining algorithm, we have some experiments with the real-life 
data downloaded from the Web. Similar to [9], we choose the Internet Movies 
Database (IMDb) at http://us.imdb.com on the Web as our experimental data source. 
Our experiments are conducted on a PC platform with a Pentium II 200M CPU, 64M 
RAM, and Window NT OS. The results are illustrated in Fig. 2 and Fig.3. Here, 
speedup ratio means the time ratio of our new incremental method to the non- 
incremental A-Priori methods as K.Wang ’s algorithm in [9]. Fig.2 shows the relation 
between speedup ratio and incremental data size. It’s obvious that as the incremental 
size grows, the speedup ratio decreases. The effect of support threshold on speedup 
ratio is illustrated in Fig.3 which tells us that the increasing of support threshold will 
cause efficiency degradation of the incremental mining algorithm. Overall, the 
incremental mining approach outperforms the non-incremental mining method. From 
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our best experimental results, the speedup ratio of incremental mining approach can 
reach over one order in magnitude. 



5 Conclusions 

In this paper, we present a method for extracting schema from semistructured data. 
Clearly, statically mining the semistructured data is not enough, for generally the on- 
line data (e.g. on the Web or in the digital library) is dynamically updated. Such a 
situation needs an incremental mining solution. An efficient algorithm for incremental 
mining of schema for semistructured data is developed, some experimental results are 
given. The experimental results show that incremental mining approach outperforms 
the non-incremental method considerably. 

Mining complex-structured data is still an immature field compared to other 
fields due to the complexity and irregularity of data structure. Mining schema for the 
semistructured data is only a little step in this direction. There are a lot of open 
problems, such as mining Internet documents, mining Web multimedia etc, are 
waiting for sophisticated solutions. 
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Fig.2. Speedup Ratio vs. Incremental Size (Original Data Size is 10000,minsup=15%) 
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Fig. 3. Speedup Ratio vs. Support (Original data size is 10000,increment=1000) 
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Abstract. Querying a database for document retrieval is often a pro- 
cess close to querying an answering expert system. In this work, we 
apply the knowledge discovery techniques to build an information re- 
trieval system by regarding the structural document database as the 
expertise of the knowledge discovery. In order to elicit the knowledge 
embedded in the document structure, a new knowledge representation, 
named StructuralDocuments{S D) , is dehned and a transformation pro- 
cess which can transform the documents into a set of SDs is proposed. 
To evaluate the performance of our idea, we developed an intelligent in- 
formation retrieval system which can help users to retrieve the required 
personnel regulations in Taiwan. In our experiments, it can be easily 
seen that the retrieval results using SD are better than traditional ap- 
proaches. 



1 Introduction 

Nowadays, database systems are useful in business and industrial environments 
due to their multiple applications. A lot of database systems are built for stor- 
ing documents, say document databases, and be deserving more attention. Es- 
pecially in the professional field, the reference manuals are usually preserved 
as the document databases in order to increase the convenience for query. A 
document database consists of a large number of electronic books. In addition 
to the electronic form of content stored in the database, some structural infor- 
mation, i.e., the chapter/section/paragraph hierarchy, may be also embedded 
in database. Classical information retrieval usually allows little structuring [3] 
since it retrieves information only on data. Therefore, the structural informa- 
tion could not be retrieved by classical information retrieval method. However, 
the structural information sometimes is useful in querying document database; 
for instance, most people usually read books with chapter-oriented concept. To 
build up an intelligent information retrieval system, our idea is regarding the 
document structure, including the index and the table of content, as the ex- 
pertise of the knowledge discovery. In order to elicit the knowledge embedded 
in the document structure, we first propose a new knowledge representation, 
named StructuralDocuments{S D) , to be the basis of our system. Second, we 
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design a knowledge discovery process of transforming the documents into a set 
of structural documents by merging two documents with similarity greater than 
the given threshold into one structural document. To evaluate the performance 
of our idea, we developed an system, named CPRCS, for Chinese personnel 
regulations in Taiwan. Our system may help users to retrieve the required per- 
sonnel regulations. As mentioned above, a transformation process from the raw 
data to the format of a database is first applied. The embedded knowledge of 
the resulting data are then elicited by applying clustering techniques. By this 
way, the semantic indices of the raw data can be established, and the suitable 
results may be obtained. In our experiments, the structural information of the 
documents can be acquired from the database using the knowledge extraction 
module. By observing the operating process of users, we found the query process 
of users are simplified by using SD. 

2 Knowledge Discovery for Structural Documents 

In this section, we will describe the flow of the knowledge discovery for structural 
documents {KDSD) as shown in Figure 1. Two kinds of existing resources in- 
cluding the index and table of content for the books are used to be the knowledge 
source in the KDSD. The content of the document database is first transformed 
into the sets of words by the partition and transformation process. By computing 
each pair of documents, the similarity matrix for documents is computed which 
can be used to find the similar pair of documents by some clustering method. 
After clustering, the hierarchy of structural documents can be obtained. 

In the followings, all of the three procedures in the KDSD are described in 
detail. 

(1) Partition and transformation 

KDSD is capable of processing English or Chinese document database. Since 
there is no obvious word boundary in Chinese text, an identification process for 
identifying each possible disyllabic word from target database is needed. After 
the disyllabic words are identified from the sentence by applying the association 
measure[2, 4], each document can be transferred to a set of keywords. It should 
be noted here that there is no need to identify words in English text, so the seg- 
mented method should be skipped for English text. In our approach, the indexes 
of the book are also used to identify the set of keywords for each document. 
Therefore, the size of both set are varying because the words are identified by 
association measure and the keywords are identified by comparing with index of 
the reference books. 

(2) Similarity measure 

To measure the similarity between two documents, we use the following heuris- 
tics: The similarity between two documents in the same chapter is higher than 
that in two different chapters, and the similarity between two documents in 
the same section is higher than that in two different sections. Without loss of 
generality, assume the whole reference book is divided into three-tier hierarchy, 
including chapter, section, and paragraph. Based upon the above heuristics, the 
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Fig. 1. Flowchart of the knowledge discovery for structural documents. 



HierarchyDependence{H D) between two documents can be easily computed. 
Let the two documents be denoted as Di and Dj. In KDSD, the similarity of Di 
and Dj, denoted by S(i,j), is computed by the following formula: 

= (1 - (5) * same{ij) + <5 * HD{i,j), (1) 

where: 

“ same{i^j) means the number of the words and keywords which appear both 
in documents Di and Dj. The value is normalized by dividing the total 
number of keywords in documents Di and Dj . 

— HD(i,j) means the hierarchy dependence of documents Di and Dj. 

— i5 is an adaptive weight value, and 0 < (5 < 1. 

The default value of i5 is 0.5, which can be adjusted by the number of chapters 
for a given book. For example, if n documents are obtained after partitioning 
step, we set 5 as the value close to 0 when the number of chapters is near to 1 or n. 
This is because the structure of the book contains little meaning. By computing 
the similarity measure of all two different documents Di and Dj, a similarity 
matrix [Sij],Sij = S{i,j) can be formed. To simplify our further discussion, let 
the matrix be an upper triangular matrix and let the diagonal elements be O’s 
by assigning Sij = 0 for i > j. 
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(3) Clustering and structuring documents 
Clustering is an important step in the KDSD for the purpose in structuring doc- 
uments. To transfer the documents into a set of structural documents, we modify 
the algorithm for similarity matrix updating in hierarchy clustering proposed by 
Johnson(1967)[l]. As we know, a lot of hierarchy clustering approaches including 
single-link and complete-link methods are considered in measuring similarity of 
clusters. It seems that the clustered hierarchy generated by complete-link method 
is more balanced than the one generated by single-link method. In our modified 
algorithm, the complete-link method is implemented, and in the following the 
definition and notation of SD are formally defined to represent the knowledge 
in clustering process: 

Definition 2.1 The Structural Document with level I is defined recursively as 
follows: (a) A document Di is a Structural Document with level 0, denoted as 
SDi which also can be denoted as (oDi)o. (b) A pair of Structural Documents, 
denoted as {iSDi, SDj)i, with the greatest similarity measure, which is greater 
than a threshold 9, among all different pairs is also a Structural Document with 
level I, where I = Maximumf m,n ) + 1, m is the level number of SDi, and n 
is the level number of SDj . 

Definition 2.2 (Similarity of Structural Documents) The similarity S' 
between the structural documents SDi and SDj is defined as follows: (a) If 
SDi = (Di) and SDj = (Dj), the similarity S' {SDi, SDj) = S{i,j). (b) The 
similarity between {SDi, SDj) and SD^ is defined as S' {SD^, {SDi, SDj)) = 
0 {S'{SDk, SDi), S'{SDk, SDj)}, where the operator Q means ’minimum’ for 
complete-link method, or ’maximum’ for single- link method. 

Assume we have n documents, and then an n x n similarity matrix can be 
generated by our method. Firstly, each document is assigned to be a structural 
document, i.e. SDi = {Di) for document Di. Moreover, the similarity matrix for 
documents is transferred to the initial similarity matrix for structural documents. 
The elements of the similarity matrix [5'F] is the similarity measure of structural 
documents SDi and SDj, that is, 5^ = Sij. Let T = {SDi} be the set of all 
SDi. Based upon the Johnson’s algorithms (1967) for hierarchy clustering [1], we 
propose the following procedure for similarity matrix updating: 

Step 1: Find the most similar pair of structural documents in the current sim- 
ilarity matrix, say pair {p,q}, where, S'p ,^ = Maximum {S'i j, for any i,j}. 

Step 2: Merge structural documents SDp and SDq into a new structural doc- 
ument {SDp, SDq). 

Step 3: Delete the structural documents SDp and SDq from the set T, and 
insert the new structural document {SDp, SDq) into the set, i.e., T ^ T — 
{SDp} — {SDq} -\- {{iSDp, SDq)i}, where I = Maximum( m,n ) -I- 1, m is 
the level number of SDp, and n is the level number of SDq. 
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Step 4: Update the similarity matrix by deleting the rows and columns related 
to structural documents SDp and SDq and adding a row and a column 
corresponding to the new structural document {SDp, SDq). 

Step 5: If there are no two structural documents with similarity greater than 9, 
stop. Otherwise, go to Step 1. 

After clustering, the hierarchy of structural documents can be obtained ac- 
cording to the set 'F. 

3 Experiments 

As we know, a sound legal system and complete regulations are usually of most 
importance for a government by law. Right now, the personnel regulations {PR) 
for civil servant in Taiwan seem to be very sophisticated. Although several kinds 
of reference books about personnel regulations have been provided for general 
public to inquiry, they are not easy to use and a lot of access time is required. 
Therefore, how to improve the methods of inquiry and annotation of the per- 
sonnel regulations have become an important issue. In the past two years, we 
implemented a prototype of database based on SD, named CPRCS (Chinese 
Personnel Regulations Consultation System), to assist the querying process for 
PR documents. In CPRCS, users access the system through the web pages in- 
terface. Two different modes, with or without SD, are provided for users as they 
prefer. Eight groups, each including two or three users, are testing to evaluate 
the performance of the SD. Let the average rate of accuracy for each group be 
the ratio of dividing the amount of the documents conformed user’s demand by 
the amount of total retrieval results. Table 1 shows the rate of accuracy for re- 
trieval results with or without SD. It can be easily seen that the retrieval results 
using SD are better than traditional approaches. 

Table 1. Accurate rate for retrieval results with or without SD. 



Group’s ID 


1 


2 


3 


4 


5 


6 


7 


8 


Average 


With SD 


.87 


.73 


.84 


.82 


.91 


.78 


.89 


.83 


.834 


Without SD 


.78 


.65 


.81 


.58 


.52 


.67 


.85 


.76 


.703 
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Abstract. These days more and more crucial and commercially 
valuable information becomes available on the World Wide Web. Also 
financial services companies are making their products increasingly 
available on the Web. This paper suggests and investigates six methods 
to combine multiple categorical predictions that are generated from 
distributed information available on the World Wide Web. 



1 Introduction 

There are various types of financial information sources on the Web (www.wsj.com, 
www.ft.com, www.bloomberg.com, etc) providing real-time news and quotations of 
stocks, bonds and currencies. We make use of these distributed textual sources to 
generate predictions about the world's major stock market indices j^. Probably the 
most important factor influencing the prediction accuracy is the selection of the data 
sources. If the quality of the data source or information is low then no meaningful 
predictions can be expected. This comes as no surprise: analyzing bad data leads to 
bad results (i.e., GIGO, Garbage In Garbage Out). In this paper, we investigate the 
natural approach of trying to exploit all relevant data sources. From each data source a 
forecast is produced and finally a consensus prediction is generated. 

Our application is to predict the Hang Seng Index (HSI), Hong Kong's major stock 
market index. If the closing value of HSI versus the previous day’s closing value 
moves up at least x%, then the classification is up. If the closing value of HSI versus 
the previous day’s closing value slides down by at least x%, then the classification is 
down. Finally, when HSI neither goes up nor down then it is steady. For the chosen 
training and test period, we set x to be 0.5 as this makes each of the possible outcomes 
up, down and steady about equally likely. 

2 Framework 

Suppose a set of training examples or time points T={ti, t 2 , .... t„J. Each tuple t,- 
belongs to exactly one of the classes C={ci, cj, .... c^}. This actual classification of 
tuple ti is denoted a, such that a, e C. There is a set S={s/, S 2 , .... s„} of data sources. 
From each source Sj we generate the classification sy, where Sy e C, for tuple t,. The 

N. Zhong and L. Zhou (Eds.): PAKDD'99, LNAI 1574, pp. 174-179, 1999. 

© Springer- Verlag Berlin Heidelberg 1999 



Combining Forecasts from Multiple Textual Data Sources 175 



problem is to generate the co nsensus prediction pi, where />, e C, for t,. The overall 
situation is depicted in Fig- M 





Si 


Sj 




consensus 

prediction 


actua 
1 class 


ti 


Su 
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Pi 


a-i 



Fig. 1. Individual and consensus classification for the tuples {tj, t 2 , , t,J. 



The generation of the consensus forecast pi takes into account all individual 
forecast s^j plus three kinds of meta information about each data source and classifier 
pair: the quality of the data source (this may be an objective or a subjective 
measurement), the historic classification accuracy, and prior probability information. 

Zhang et al. 0 introduced a notion of data quality which is objective. That is, the 
quality of a data source is computed from the content of the source alone without 
human interaction. Data quality could also be a subjectively obtained measurement. In 
our experiments the mentioned objective data quality notion is used. Let qj denote the 
quality of a data source Sj. The quality is a real number within the range of zero to one, 
one meaning highest quality. 

The accuracy of a source is the percentage of correct predictions over a period of 
time. For a set of training examples T, the accuracy a(sj) of a source Sj is as follows. 



a(Sj) = — 



( 1 ) 



m 



where 




if S.J = c 
otherwise 



Prior probability information contains the historic prediction details of a source Sj. 
It is a kx.k matrix of conditional probabilities derived from the training examples. Let 
Prfchlci) denote the probability that source Sj classifies into ci and the actual outcome 

IS Cfi. 

We predict daily movements of Hang Seng Index. From each data source a 
prediction is generated: it moves up, remains steady or goes down. In our experiments 
we take a training set of one hundred days. The data q uality, accuracy and prior 
probability information of a source Si is provided in |Fig. 2 



Source 1 
Quality = 0.3 


Prediction 

Up Steady Down 


■si 

■§ 8 Steady 

< "3 

° Down 


30 1 3 
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4 4 28 


Sum 


45 15 40 


a(Si) = (30+10+28) / 100 = 0.68 



Fig. 2. Meta information about source Si 

Our objective is to combine predictions from individual sources to form an overall 
categorical prediction. 
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3 Consensus Prediction 

3.1 Best Quality 

Select the prediction sy if the source Sj has the highest quality among all sources. 

Pi = Sij iff qj = max {q,, q 2 , qj (2) 



3.2 Majority Voting 

Select as consensus prediction c* if it has largest number of sources predicting it. 



3.3 Voting Weighted by Quality 

This is a refinement of the majority voting. Sources with higher quality have more 
importance in the vote. Let Vi(c) denote the votes for class c when classifying case t,-.- 

Vi(c) = Zj qjxS(Sij,c) (3) 

The consensus forecast />, is class c exactly when c gets the highest vote among all 
possible outcomes. 



3.4 Conditional Probability Selection 

Recall that PrZch\ci) denotes the probability that when source Sj classifies into c;, the 
actual outcome is c*. The probability Pi(c) of each class is computed using these 
conditional probabilities. The class with highest such probability is then the consensus 
classification for example t,. 

Pi( c) = Zj Prj, (c\Sij)/n (4) 



3.5 Accuracy Selection 

The accuracy a(sj) of a source sj reflects the average historic prediction accuracy over 
all categories and is used to estimate parameters in the kxk prior conditional 
probability matrix. 

f fl(s,) if5^ =c 



, — otherwise 

I k-1 

If a source predicts a class c, then the likelihood for the final outcome to be c is 
estimated by taking its accuracy a(Sj). On the other hand, the likelihood of the final 
outcome to be any other class than c is estimated to be (1- a(sj))/(k-l), where k is the 
total number of categories. That is, suppose up is predicted by a source with 
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accuracy 40%, then the likelihood that it is down or steady is both set to 30% 
{(100% 40%)/2 ). 

The probability P/c) of each class is computed using these conditional 
probabilities. The class with highest such cumulative likelihood is then the consensus. 

Pi(c) = Zj Prj/c\ Sij) /n ( 6 ) 



3.6 Meta Classifier 

From the training examples the prediction matrix shown in |Fig. 3 1 is produced. 
Consider the tuple t„ the number of sources predicting class c/ is v,y = Z-, S(sij,ci). 
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Fig. 3. Prediction matrix 



This matrix is considered as training data itself Each row constitutes one example 
where a, is the classification. This data is used to again generate decision rules. Upon 
receiving new predictions from individual sources, the decision rules then conclude 
the overall or consensus prediction. 



4 Experiments 

There are thousands of pages updated each day. Each page contains valuable 
information about stock markets. The address of certain information is usually fixed. 
For example, the page www.wsj.com/edition.current/articles/hongkong.htm contains 
news about Hong Kong's stock market. This page will be considered as one data 
source on the Internet. However, some news page addresses change from time to time, 
usually these pages are under a content page. For example, the page 
www.ft.com/hippocampus/ftstats.htm is the content page for various regional financial 
news pages. But the individual news pages under this content page have changing 
URL addresses. So the fixed content page together with its underlying news pages is 
considered as one identifiable information source. Among numerous financial news 
Web sites on the Internet, we restricted ourselves to five financial information 
networks, the ones named above. These networks provide fast and accurate 
information. From those networks we selected forty one Web sources which we 
consider - using our common sense and understanding of stock markets - to be already 
relevant to the task of predicting Hong Kong's stock market index, the Hang Seng 
Index. We collected these data in the period 14 Feb. 1997 to 6 Nov. 1997. This 
provides a total of 179 stock trading days. The forecasting is done as described in 
Wiithrich et al. Q and Cho et al. 0. The first 100 trading days are used for training 
and estimation of the quality, accuracy and conditional probabilities. The 
remaining 79 days are used to test the performance of the six consensus forecasting 
methods. 

A good yardstick of the consensus forecasting is the accuracy, i.e. what percentage 
of the predictions is correct. For instance, if the system predicts up and the index 
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moves in deed up then it is correct. T he accuracy is shown in the second column of 
[[Table ij The third column in Table 1 I ndicates how many times the system predicts up 
or down and it was actually steady; or, the system predicts steady and it was actually 
up or down. The last column indicates the percentage of totally wrong predictions. 
That is, the system expects the index to go up and it moves down, or vice versa. 



Table 1. Performance on the 79 trading days 





accuracy 


Slightly wrong 


wrong 


Best Quality Source 


30.4% 


62.0% 


7.6% 


Majority Voting 


51.9% 


29.1% 


19.0% 


Weighted Majority 


38.0% 


22.8% 


39.2% 


Cond. Prob. Selection 


39.2% 


60.0% 


3.8% 


Accuracy Selection 


44.3% 


36.7% 


19.0% 


Meta Classifier 


44.3% 


22.8% 


32.9% 



Among the six consensus schemes, majority voting with an accuracy of above 50% 
is outperforming the others. Moreover, the percentage on wrong classifications is 
rather low, 19%. The second best method is accuracy selection with 44.3% accuracy 
and 19% wrong classifications. The other methods, except probabilistic rules, are 
clearly lagging behind. There is 99.9% probability that an accuracy of 51.9% on 79 
tes t cases can not be achieved by random guessing. 

pig. 4| confirms the conjecture that the prediction accuracy of individual sources is 
less than the accuracy achieved by taking more complete information into account and 
combining it appropriately. 



♦ Correct Accuracy (Majority Voting) 

■ Median Slightly Wrong (Majority voting) 




\ '<3 \ 



‘i' n’' -P ^ •b’' 4= I**' 



Fig. 4. Performance of individual sources compared with majority voting 
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5 Conclusions 

We investigate six methods for making consensus predictions from individual 
predictions of classifiers. We showed that consensus decisions lead to higher accuracy 
as compared to the accuracy achieved even from the best quality source. In particular, 
majority voting is proved to be a simple but amazingly convincing way to produce the 
consensus. 
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Abstract. This paper presents the method of domain knowledge extracting in 
NChiql, a Chinese natural language interface to databases. After describing the 
overall extracting strategy in NChiql, we mainly discuss the basic semantic 
information extracting method, called DSE. A semantic conceptual graph is 
employed to specify two types of modification and three types of verbal 
relationship among the entities, relationships and attributes. Compared with 
related works. DSE has more strongly extracting ability. 



1 Introduction 

Natural language interfaces to databases (NLlDBs) is a system that allows the user 
to access information stored in a database by typing requests expressed in some 
natural language (e.g. Chinese, English). Since the early 1960s, much of the research 
on NLIDB has been motivated by its potential use for communicating with DBMS. 
Now two main issues hinder NLIDBs to gain the rapid and wide commercial 
acceptance: portability and usability[\]. These problems are resulted from the system 
poor ability to cover the domain knowledge in a given application. 

Each application domain has different vocabulary and domain knowledge. Such 
uncertainty of domain-dependent knowledge reduces the usability of NLIDBs in 
different application domain. Therefore when a system is transported from one 
domain to another, it must be able to obtain new words and new knowledge. So in 
order to solve the problem of portability, it must enable the system to extract domain 
knowledge automatically or semi-automatically at least, in order to build the domain 
dictionaries for the next step of natural language processing. 

Domain knowledge extraction is the process of acquisition of sufficient domain 
knowledge and reconstruction of domain information especially the semantic and 
usage of words in order to extend the ability of language processing. Nowadays 
widely attentions have been paid to studies on knowledge extraction. Researches on 
computing linguistics proposed some corpus based extracting methods. Technique of 
domain knowledge extracting from the aspect of NLIDBs is mainly based on database 
schema. These methods focused on simple extraction of database structure 
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information whereas the potential extraction of semantic knowledge is not fully 
considered. 

We think that extraction of domain knowledge in NLlDBs must not only 
reconstruct the information framework of the database and extract the semantic and 
usage information in the database schema, but also extract semantic and usage 
information based on user conceptual schema. Because most part of these information 
is not reflected in existing database schema and they are one part of user’s own 
knowledge, it requires to be extracted by means of suitable learning mechanism. 



2 Domain Knowledge’s Classification and Extracting Strategies 

Generally, the domain knowledge of NLIDBs can be classified into two types: 

• Domain-Independent Knowledge, e.g. the usage of some common words, such 
as conjunction, preposition, all of which are independent on specific database 
application domain; and 

• Domain-Dependent Knowledge, which is dependent on specific application 
domains. 

Extraction of domain knowledge means to make up the gap between the realistic 
world knowledge model and database schema. Its goal is to provide necessary 
grammar and semantic information for the language processing in NChiql. Generally, 
domain-dependent knowledge falls into two catalogs: 

1) Basic semantic information, which is contained in the database schema. The 
database logical model conveys some basic semantic information: entity and relation 
are expressed by tables, their names is expressed by symbol names; the primary key 
and foreign key describe the mutual reference relationship between entities, while 
relationship embodies what interplay relation of these entities and what positions 
these entities are; the constraint information depicts the restraint when an attribute is 
assigned a value. 

2) Deductive semantic information, which is the extension to basic one and exists 
in domain application but is omitted when creates the database schema. Generally, 
such knowledge can be computed and deducted based on the basic information. For 
example, if the database stores the latitude and longitude information of different 
cities, then the distance between each city is the deductive information. 

Because basic semantic information is dependent on relatively static database 
schema, the first phase of our extraction is DSE — Database Schema Extraction. 
Generally, DSE can extract the following grammar and semantic information: entity, 
attribute, relationship (the verb usage of entity’s names). 

However, the design of database schema can not be controlled, some of which are 
non-normal (for example, they lack basic primary key constrain), so the database 
schemas maybe conform to different normalization. Because our extracting method is 
mainly based on heuristic rules, the above characteristic of non-normal gives rise to 
great difficulties. 

Deductive semantic information exists beyond the database scheme of an 
application. For the complication of deductive information, the extracting of 
deductive information must employ different techniques to grasp the deductive 
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concepts as more as possible. We propose the following hybrid-extractor for 
deductive information; 

• Corpus based extracting, acquire more grammar and semantic information, by 
collecting relevant query corpus and process of analyzing. For example, ’’live” is a 
verb that is not contained in basic semantic information, it is the verb connecting the 
entity “student” and the attribute “address”, such potential word can be obtained only 
by plenty of corpus; 

• Data mining based extracting: the database contains abundant semantic 
knowledge, which can be acquired by data mining. 

• Learn-by-query extracting: while communicating with the user, the system 
learns new knowledge continually and enriches its domain knowledge. 



3 Database Schema Based Extracting in NChiql 

NChiql is a Natural Chinese Query Language interface to database. In process of 
extraction, NChiql extracts the essential information of entities, relationships and 
attributes in databases by analyzing the relationship among these database objects. 
The procedure of DSE in NChiql is listed as following steps: 

(1) Get database scheme information through data dictionary; 

(2) Check whether every object is Entity or Relationship. 

(3) If it is Entity then extracting the semantics of entity; 

(4) If it is Relationship then extracting the semantics of relationship; 

(5) Extract the information of attributes of entity or relationship. 

The extracting is mainly based on some heuristic rules. For example our judgement 
of entities and relationship is based on the following rules: 

In a database schema R(R|,R 2 ,...R„), if in Ri there exists the primary key k, (A,, 
Aj, . . . , A„ ) , m>0, 

( 1 ) If m= 1 , Rj is an entity. 

(2) If m>l, and primary key of Rj is kj, k;n kj 7^0, then Rj is relationship; 

otherwise R; is an entity. 

All the results of extracting are stored in dictionaries. NChiql divides dictionaries 
into two types, which are independent on domain and dependent on domain, namely 
general dictionary and specific dictionary. 

• General dictionary is independent on application and is the core of the system, it 
records the semantic of words which are most commonly used, e.g. pronoun, 
conjunction, quantifier, interrogative, routine words, etc. 

• Specific dictionary records the semantics and usage of words, which are usually 
used in a specific application domain. When the system is being transported to 
another domain, it must be reconstructed. The specific dictionary includes: 

(1) Dictionary of entity’s semantics, which records the names of tables, 
synonyms and the semantics of the modifying attributes. 

(2) Dictionary of the attribute’s semantics, which records the attribute’s 
names, synonyms, hints, modifiers and the semantics of constraints. 

(3) Dictionary of the verb’s semantics, which records the semantics and 
usage of the verb. 

(4) Domain knowledge rules. 
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We call the results of DSE as semantic conceptual model, which reflects not only 
the semantics of words, but also, combination relationship among different words. So 
it's more powerful than the E-R model in term of semantics expression. The Figure 1 
illustrates the conceptual model. 
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Fig. 1. Illustration of the semantic conceptual graph 



The semantic conceptual graph regards semantic concept (words) as its main core, 
emphasizes on reflecting the semantic relationship among words. Generally, there are 
two types of modifying relationship and three types of verb relationship. To be 
specific, the types of modifying relationship are: 

(1) The Direct modifying relationship among entities, including the attribute- 
modify-entity and the entity-modify-entity. 

(2) Reference modifying among entities, which is engendered by foreign key 
reference relationship among entities. 

The three types of verb relationship mean the verb combinations of relationships 
and entities, attributes and entities, entities and entities. 

Modifying relationship can mainly determine the combination relationship of noun 
phrase(NP), while verb relationship presents the usage of verb phrase(VP). It is 
extremely useful to be provided with these semantic relationships in the analyzing 
process of natural language queries. 

When NLIDB is transforming natural language into database language, it needs not 
only natural language knowledge, but also relevant database semantic knowledge, i.e. 
the mapping information from natural language to database semantic. NChiql is 
different from other extracting methods in that the usage of words is represented by 
the database semantic, namely the composition of words in a sentence and the 
relationship of each word are directly mapped to the table and column in database 
logical model. Such extracting method is word-driven and it is based on the semantic 
and functional characteristic of word. 

In reality, because of the differences in the quality of the designers, there are so 
many database schemas that are difficult to suit the pre-assumption of DSE. To 
overcome these non-formal cases and alleviate the burden of the user as far as 
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possible, we propose some solutions, such as determining the primary key by reverse 
engineering [2], processing of non-normalization by building views, etc. 



4 Related Works 

TED[3] is one of the earliest prototype systems that dedicating to the problem of 
transportability ofNLIDB. An important characteristic of TED is that it provided an 
automatic extracting interface-Automated Interface Expert, through which the user 
could guide the system to learn language and logical knowledge of the specific 
domain. The extracting ability of TED is very limited. Compared with TED, the 
extracting technology of TEAM [4] is relatively mature and complete. At first the 
extracting method of TEAM is menu driven and aided by interaction. In TEAM all 
the database objects to be extracted as well as the words extracted are listed in the 
menu, thus the whole extracting structure is very clear. 

However, none of the above systems are able to take advantage of the information 
contained in the database schema sufficiently. Although TEAM includes extracting of 
the primary key and foreign key, it hasn’t analyzed the relationship among entities 
and attributes, therefore it biased on extracting the general grammar of the words in a 
natural language category, thus increasing the complexity of extracting and leading to 
the complexity of language processing. 



5 Conclusion 

The semantic extracting is an indispensable part of natural language processing and 
the method of semantic extracting reflects the characteristic of the natural language 
processing, By extracting sufficiently the all sorts of information hidden in the 
domain, we could reduce the ambiguity of the words as far as possible and improve 
the efficacy of the natural language processing. 

Now the difficult problem is the extracting of deductive semantic information. 
NChiql has a perfect mechanism of extracting the basic semantic information, but lots 
of work should be done further on the extracting of deductive semantic information. 
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Abstract. Data Mining delivers novel and useful knowledge from very 
large collections of data. The task is often characterised as identifying 
key areas within a very large dataset which have some importance or 
are otherwise interesting to the data owners. We call this hot spots data 
mining. Data mining projects usually begin with ill-defined goals ex- 
pressed vaguely in terms of making interesting discoveries. The actual 
goals are refined and clarified as the process proceeds. Data mining is an 
exploratory process where the goals may change and such changes may 
impact the data space being explored. In this paper we introduce an ap- 
proach to data mining where the development of the goal itself is part 
of the problem solving process. We propose an evolutionary approach to 
hot spots data mining where both the measure of interestingness and the 
descriptions of groups in the data are evolved under the influence of a 
user guiding the system towards significant discoveries. 



1 Introduction 

Data mining is an inherently iterative and interactive process. A fundamental 
concern is the discovery of useful and actionable knowledge that might be con- 
tained in the vast collections of data that most organisations today collect but 
usually can not effectively analyse. In applying data mining techniques in a num- 
ber of case studies with industrial collaborators (in health care, taxation, and 
insurance) we have developed the hot spots methodology for assisting in the task 
of identifying interesting discoveries (Williams and Huang 1997). 

The hot spots methodology originally entailed the use of clustering and rule 
induction techniques to identify candidate groups of interesting entities in very 
large datasets. These groups were evaluated to assess their interestingness (i.e., 
whether they represented useful and actionable discoveries for the particular do- 
main of application) . In dealing with very large datasets the number of identified 
candidate groups becomes very large and is no longer amenable to simple nor 
manual evaluation. The groups, described by symbolic rules, serve as a reason- 
able starting point for the discovery of useful knowledge. However, it has been 
found empirically that an exploration of other but related areas of the data 
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(e.g., nearby regions within the dataset), in concert with the domain user, leads 
to further interesting (and sometimes much more interesting) discoveries. 

The focus of our data mining work is to support the domain user (fraud 
investigators, auditors, market analysts) in the task of focusing on interesting 
groups of a very large collection of data (many millions of records) . The emphasis 
on support is important for data mining as our experience suggests that domain 
expertise will remain crucial for successful data mining. While the hot spots 
methodology established a useful starting point, it provided only little support 
to proceed further in identifying the most interesting discoveries from amongst 
the many thousands that were being made and others that lay nearby in the 
vast search space. 

In this paper we develop an architecture for an evolutionary hot spots data 
mining system that is under development. The starting point is the many discov- 
eries that the current hot spots methodology identifies. An evolutionary approach 
is employed to evolve nuggets using a fitness measure that captures aspects of 
interestingness. Since “interestingness” is itself hard to capture we closely cou- 
ple with the nugget evolution a measure of interestingness that is itself evolved 
in concert with the domain user. We describe an architecture where a number 
of measures of interestingness compete to evolve alternative nugget sets from 
which the best nuggets are presented to the domain user for their ranking. This 
ranking is fed back into the system for further evolution of both the measure of 
interestingness and of the nuggets. 

2 The Search for Interesting Nuggets 

Padmanabhan and Tuzhilin (1998) demonstrate the need for a better grasp on 
the concept of interestingness for data mining with an example from marketing. 
Applying a traditional apriori association algorithm to the analysis of 87,437 
records of consumer purchase data, over 40,000 association rules were gener- 
ated, “many of which were irrelevant or obvious.” Identifying the important and 
actionable discoveries from amongst these 40,000 “nuggets” is itself a key task 
for data mining. 

The concept of interestingness is difficult to formalise and varies consider- 
ably across different domains. A growing literature in data mining is beginning 
to address the question. Early work attempted to identify objective measures of 
interestingness, and the confidence and support measures used in association al- 
gorithms are examples of objective measures. One of the earliest efforts to address 
the explosion of discoveries by identifying interestingness was through the use of 
rule templates with attribute hierarchies and visualisation (Klemettinen, Man- 
nila, Ronkainen, Toivonen and Verkamo 1994). Silbershatz and Tuzhilin (1996) 
partition interestingness measures into objective and subjective measures, and 
further partition subjective measures into those that capture unexpectedness 
and those that capture actionability. 

Many authors have focussed on capturing unexpectedness as a useful mea- 
sure, particularly in the context of discovering associations (Silbershatz and 
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Tuzhilin 1995) and classifications (Liu, Hsu and Chen 1997). Most recently Pad- 
manabhan and Tuzhilin (1998) develop an unexpectedness algorithm based on 
logical contradiction in the context of expectations or beliefs formally expressed 
for an application domain. 

Capturing actionability is a difficult and less studied proposition. Matheus, 
Piatetsky-Shapiro and McNeill (1996) discuss the concept of payoff as a measure 
of interestingness, where they attempt to capture the expected payoff from the 
actions that follow from their discoveries (deviations) . There is little other work 
specifically addressing actionability. 



3 The Hot Spots Methodology 

We characterise our concept of interestingness in terms of attempting to iden- 
tify areas within very large, multi-dimensional, datasets which exhibit surprising 
(and perhaps unexpected) characteristics that may lead to some actions being 
taken to modify the business processes of the data owners. At this stage, rather 
than specifically formalising how interestingness can be expressed we are ex- 
ploring how we can facilitate the domain user in their search for interesting 
discoveries in their data using the hot spots methodology. 

We now introduce some terminology to describe the hot spots methodology, 
following Williams and Huang (1997). A dataset V consists of a set of real world 
entities (such as a set of policy holders in an insurance company or a set of 
Medicare patients). Generally V is relational with only one universal relation 
R(Ai, A 2 , . . . ,Am) where the A, are the attributes of the entities. The dataset 
consists of a set of entities: T> = {ei,e 2 , . . . ,e„}, where each entity is a tuple 
{vi,V 2 , - ■ ■ ,Vm) of values, one value for each attribute. For real world problems 
the number of attributes m and the number of tuples n are typically “large” (m 
may be anywhere from 20 to 1000 and n typically greater than 1,000,000). 

The hot spots methodology uses a data-driven approach to generate a set of 
rules TZ = {ri,r 2 , ■ ■ ■ ,Tp}, where each rule describes a group or set of entities 
9i ~ {6jki(ej)}, Pi C T>. (The Boolean function riiej) is true when entity ej 
is described by rule rj.) We will find it convenient to refer to the set of groups 
described by TZ as Q = {gi, 52 , ■ ■ ■ ,5p} but regard TZ and Q to be essentially 
synonymous and call each element of TZ (or as the purpose suits, each element 
of Q) a nugget. The set of nuggets is synonymously M = {ri,r 2 , . . . ,rp} or 
AA = { 51 i 52 , ■ ■ ■ , <7p}- We note that p is generally much smaller than n but can 
still be substantial (perhaps one or two thousand for n in the millions). A rule 
consists of a conjunction of conditions, each condition being either: Aj € [^ 1 ,^ 2 ] 
for numeric attributes or A, G {ui,U 2 , . . . ,Ug} for categorical attributes. While 
we have reduced the dimensionality of the problem (from n down to p) for real 
world applications p generally remains too large for manual consideration. 

We identify a hot spot as a set of entities which are of some particular in- 
terest to the domain user (e.g., loyal customer groups or regular high insurance 
claimers). Simple techniques such as clustering or segmentation can help with 
the task of identifying nuggets that are candidate hot spots, but are often compu- 
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tationally expensive and/or build groups that are not well described. A heuristic 
approach to this segmentation task that we have empirically found to be effec- 
tive in many real world problems involves the combination of clustering and rule 
induction, followed by an exploration of the discovered groups (Williams and 
Huang 1997), which we call the hot spots methodology. 

The hot spots methodology is a three step process: 

Step 1: Cluster V into p complete and disjoint clusters C = {C\,C 2 , ■ ■ ■ ,Cp\ 
where V = [jCi and Cj fl Cj = 0,i ^ j. We generally use a mixed data-type 
k-means based clustering algorithm (Huang 1998). 

Step 2: By associating with each record its cluster membership we use rule 
induction to build discriminatory descriptions of each cluster, leading to the 
rule set TZ = {ri,r 2 , . . . ,rg}. Usually q > p and usually much greater (for each 
clusters multiple rules may be induced) . We will refer to a rule as a description of 
a nugget (or simply as a nugget) . Each nugget describes a subset of the original 
dataset V and rj represents both the nugget description and the nugget subset. 
Note that r* n rj is not necessarily empty. 

Step 3: The third step is to evaluate each nugget in the nugget set to find those 
of particular interest. We define the function Eval{r) as a mapping from nuggets 
to a measure of the interestingness of nugget r. Such a function is domain depen- 
dent and is the key to effectively mining the knowledge mine. The nuggets may 
be evaluated in the context of all discovered nuggets or evaluated for their action- 
ability, unexpectedness, and validity in the context of the application domain. 
This is the heart of the problem of interestingness. 

An empirically effective approach to evaluating nuggets is based on building 
statistical summaries of the nugget subsets. Key variables that play an impor- 
tant role in the business problem at hand are characterised for each nugget and 
filters are developed to pick out those nuggets with profiles that are out of the 
ordinary. As the data mining exercise proceeds, the filters are refined and further 
developed. 

A visualisation of the summaries provides further and rapid insights to aid 
the identification of hot spots using a simple yet effective matrix-based graphic 
display of the data. This facilitates the task of working towards a small (man- 
ageable) collection of nuggets towards which further resources can be devoted. 

Domain users provide the most effective form of evaluation of discovered 
nuggets. Visualisation tools are also effective. However, as the nugget sets become 
large, such manual approaches become less effective. 



4 Hot Spots Applications 

We illustrate the hot spots methodology in the context of two case studies in- 
volving data from commercial collaborators. These relate to actual data mining 
exercises carried out on very large collections of data. While the actual results 
and data remain confidential we present indicative results in the following sec- 
tions. 
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4.1 Hot Spots for Insurance Premium Setting 

NRMA Insurance Limited is one of Australia’s largest general insurers. A major 
task faced by any insurer is to ensure profitability, which, to oversimplify, requires 
that the total sum of premiums charged for insurance must be sufficient to cover 
all claims made against the policies, while keeping the premiums competitive. 
Our approach has been to identify, describe, and explore customer groups that 
have significant impact on the insurance portfolio — using the hot spots method- 
ology for risk assessment by better understanding and characterising customers. 
After preprocessing the dataset the three step hot spot methodology was used: 
clustering; rule induction; nugget evaluation. 

We present an example here consisting of a dataset of just some 72,000 records 
with 20 attributes, clustered into some 40 clusters, ranging in size from tens of 
records to thousands of records. Treating each cluster as a class we can build a 
decision tree to describe the clusters and prune the tree through rule generation. 
This leads to some 60 nuggets. An example is: 

No Claim Bonus < 60 and Address is Urban and 
Age < 24 and Vehicle G {Utility, Station Wagon} 

An evaluation function was developed to identify interesting nuggets (groups 
of customers that exhibited some important characteristics in the context of 
the business problem). This began by deriving for each nugget a collection of 
indicators, such as the number and proportion of claims lodged by clients and 
the average and total cost of a claim for each nugget subset. 

This summary information is presented in Table 1 for some nuggets. Over the 
whole dataset (the final row of the table) there were 3800 claims, representing a 
proportion of some 5% of all clients (a typical figure). The overall average claim 
cost is $3000 with a total of some $12 million of claims. Particular values that 
are out of the ordinary in the context of the whole dataset are italicised and 
nuggets that are above a certain threshold for interestingness based on these 
values are highlighted. 

Our evaluation function identifies nuggets of reasonable size containing a high 
proportion of claims (greater than 10%) and having large average costs. This 
exploration and the refinement of the measure of interestingness is performed by 
the domain user. 



4.2 Hot Spots for Fraud Detection in Health 

The Australian Government’s public health care system. Medicare, is managed 
by the Health Insurance Commission (HIC) who maintain one of the largest data 
holdings world wide recording information relating to all payments to doctors 
and patients made by Medicare since 1975. Like any large and complex payment 
system. Medicare is open to fraud. The hot spots methodology has been used to 
identify areas which may require investigation. 

A subset of 40,000 of the many millions of patients is used for illustration 
here. The data consists of over 30 raw attributes (e.g., age, sex, etc.) and some 
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Table 1. Summary motor vehicle insurance nugget data. 



Nugget Size Claims Proportion Average Cost Total Cost 



1 1400 

2 2300 


150 

140 


11 

6 


3700 

3800 


545.000 

535.000 


3 


25 


5 


20 


4400 


13,000 


4 


120 


10 


8 


7900 


79,100 


5 


340 


20 


6 


5300 


116,000 


6 


520 


65 


13 


4400 


280,700 


7 


5 


5 


100 


6800 


20,300 


60 


800 


1400 


5.9 


3500 


2,800,000 


All 3800 


72000 


5.0 


3000 


12,000,000 



20 derived attributes (e.g., number of times a patient visited a doctor over a 
year, number of different doctors visited, etc.). 

Nuggets were generated from clusters leading to over 280 nuggets. An exam- 
ple is: 

Age e [18,25] and Weeks Claimed > 10 and 
Hoarding Index > 15 and Benefit > $1000 

Table 2 lists some nuggets, with cells of particular interest italicised and rows 
above a threshold for interestingness highlighted. 



Table 2. Summary Medicare nugget data. 



Nugget 


Size Age Gender Services Benefits Weeks Hoard Regular 


1 


9000 


30 


F 


10 


30 


2 


1 


1 


2 


150 


30 


F 


24 


841 


4 


2 


4 


3 


1200 


65 


M 


7 


220 


20 


1 


1 


4 


80 


45 


F 


30 


750 


10 


1 


1 


5 


90 


10 


M 


12 


1125 


10 


5 


2 


6 


800 


55 


M 


8 


550 


7 


1 


9 


280 


30 


25 


F 


15 


450 


15 


2 


6 


All 40,000 


45 




8 


30 


3 


1 


1 



With 280 nuggets it becomes difficult to manually scan for those that are 
interesting. For larger Medicare datasets several thousand nuggets are identified. 
The evaluation function takes account of the average number of services, the 
average total benefit paid to patients, etc. 

The approach has successfully identified interesting groups in the data that 
were investigated and found to be fraudulent. A pattern of behaviour identified 
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by this process was claim hoarding where patients tended to collect together 
many services and lodge a single claim. While not itself indicative of fraud a 
particularly regular subgroup was found to be fraudulent. 

5 Evolving Interesting Groups 

5.1 Background 

The hot spots methodology was found to be a useful starting point for an explo- 
ration for interesting areas within very large datasets. It provides some summary 
information and visualisations of that information. However, as the datasets be- 
come larger (typically we deal with many millions of entities) the number of 
nuggets becomes too large to manually explore in this way. The simple expression 
for interestingness based on comparisons of summary data to dataset averages 
is of limited use. 

To address this we propose an evolutionary approach that builds on the 
framework provided by the hot spots methodology. The aim is to allow domain 
users to explore nuggets and to allow the nuggets themselves to evolve according 
to some measure of interestingness. At a high level the two significant problems 
to be addressed are: 

1. How to construct nuggets? 

2. How to define the interestingness of a nugget? 

In a perfect world where we could define interestingness precisely there would 
be no problem. The definition would be used directly to identify relevant nuggets. 
The nature of the domains and problems we consider in data mining though is 
such that both the data and the goals are very dynamic and usually ill-defined. 
It is very much an exploratory process requiring the sophisticated interaction of 
domain users with the data to refine the goals. Tools which can better facilitate 
this sophisticated interaction are needed. 

We employ evolutionary ideas to evolve nuggets (described using rules) . Pre- 
vious work on classifier systems in Evolutionary Computation has also consid- 
ered the evolution of rules, and there is a limited literature on using evolution- 
ary ideas in data mining (Freitas 1997; Radcliffe and Surry 1994; Teller and 
Velosa 1995; Turney 1995; Venturini, Slimane, Morin and de Beauville 1997). 

The hot spots methodology begins to address both of the high level prob- 
lems identified above: constructing nuggets and measuring interestingness. For 
constructing nuggets a data driven approach is used: employing clustering and 
then rule induction. For measuring interestingness we define a measure based 
initially on the simple statistics of a group (e.g., a group is interesting if the 
average value of attribute A in the group is greater than 2 standard deviations 
from the average value of the attribute over the whole dataset). This may be 
augmented with tests on the size of the group (we generally don’t want too large 
groups as they tend to exhibit “expected” behaviour) and meta conditions that 
limit the number of hot spots to 10 or fewer. 
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Assessing true interestingness, relying on human resources to carry out ac- 
tual investigations, can be a time consuming task. As we proceed, working closely 
with the domain user, our ideas of what is interesting amongst the nuggets be- 
ing discovered becomes refined and often more complex. This requires constant 
refinement of the measure and further loops through the nugget construction 
process to further explore the search space. An evolutionary approach which 
attempts to tackle both the construction of nuggets and the measure of inter- 
estingness is developed. 

5.2 Proposed Architecture 

We describe an evolutionary architecture to refine the set of nuggets M derived 
through either a hot spots analysis, random generation, or in some other manner. 
A small subset of M is presented to the domain user who provides insights into 
their interestingness. 

A measure of interestingness of a nugget is to be determined. The aim is 
to develop an explicit function 1(g) (or Eval{r)) that captures this. An initial 
random measure of interestingness will set the process going, or else we can 
employ the simple measures used in the current hot spots methodology. 

The measure of interestingness can then be used as a fitness measure in an 
evolutionary process to construct a collection of nuggets. By using an evolution- 
ary process we can explore more of the search space. Having evolved a fit pop- 
ulation of nuggets we present some small subset of these to the domain user for 
their evaluation (to express whether they believe these to be interesting nuggets) . 
We are working towards capturing the user’s view on the interestingness of the 
nuggets and to feed this directly into the data mining process. An interesting 
twist is that the population of measures of interestingness is also evolved, based 
on the user feedback. The top level architecture of the evolutionary hot spots 
data mining system is presented in Fig. 1. 



Dataset Ruleset Hot Spots 




n= 1.000,000 
m = Attributes 



Fig. 1. The basic model for an evolutionary hot spots data miner with indicative sizes. 



At any time in the process we will have some number, g, of measures of 
interestingness: X = {Ii, I 2 , . . . , Ig}. The rules in the ruleset TZ will be evolved 
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independently in each cycle through the process, leading to q independent rule 
sets, each being fit as measured by one of the measures of interestingness Ik- 
Typically, a population of rules will consist of some thousand rules and it is not 
practical to present all rules in a “fit” population to the user. Instead a small 
subset of s rules is chosen from each population. Thus, for each cycle through the 
process, q (typically 5) independent sets of s (typically 3) rules will be presented 
to the user for ranking. The user ranks all of these q x s rules in terms of their 
assessment of the interestingness of the rules and the entities associated with the 
rules. This ranking is then used to evolve the interestingness measures through 
another generation to obtain a new population of q measures. 

The analysis of the ranking might, for example, find in the top rules some 
commonality in conditions. These could be identified as atoms to genetically 
engineer the next generation of the rule set TZ (retaining these atoms in any 
rules that are modified) . Alternatively we increase the fitness of other rules that 
contain the commonality. Indeed, under this scheme we can identify conditions 
that either increase or decrease the fitness of other rules in TZ. A tuning parameter 
is used to indicate the degree to which fitness can be modified. The evolutionary 
process then takes over to generate the next population TZ' using these new 
fitness measures. 

At each stage the domain users are involved in the process, and as inter- 
esting discoveries are brought to their attention, they assess whether further 
investigation is required. 



6 Summary 

We present an architecture that is being implemented and tested in actual data 
mining exercises. At this stage we can not make any claims about its usefulness 
although early feedback indicates that it does provide a useful expansion of the 
search space under the control of a domain user, allowing a focus on relevant dis- 
coveries to be established. The expression of the formulae capturing interesting- 
ness still requires considerable research. We expect to adopt various approaches 
to the actual expression of interestingness as developed by our experience and 
that of others. With a formal and well developed language the evolution of the 
measures of interestingness will be better able to capture the user’s insights, as 
well as providing some direction setting suggestions for exploration. 

Acknowledgements 

The author acknowledges the support provided by the Cooperative Research 
Centre for Advanced Computational Systems (ACSys) established under the 
Australian Government’s Cooperative Research Centres Program. Dr. Xin Yao 
(Australian Defence Forces Academy) has contributed tremendously in discus- 
sion of the evolutionary aspects and Reehaz Soobhany (Australian National Uni- 
versity) has implemented much of the rule evolution code. 



Evolutionary Hot Spots Data Mining 193 



References 

Freitas, A. A.: 1997, A genetic programming framework for two data mining tasks: 
classification and generalized rule induction, in J. R. Koza, K. Deb, M. Dorigo, 
D. B. Fogel, M. Garzon, H. Iba and R. L. Riolo (eds). Proceedings of the Second 
Annual Conference on Genetic Programming, Morgan Kaufmann, San Francisco, 
CA, pp. 96-101. 

Huang, Z.: 1998, Extensions to the k-means algorithm for clustering large data sets 
with categorical values. Data Mining and Knowledge Discovery 2(3), 283-304. 

Klemettinen, M., Mannila, H., Ronkainen, P., Toivonen, H. and Verkamo, A. L: 1994, 
Finding interesting rules from large sets of discovered association rules, in N. R. 
Adam, B. K. Bhargava and Y. Yesha (eds). Proceedings of the Third International 
Conference on Information and Knowledge Management, ACM Press, pp. 401-407. 

Liu, B., Hsu, W. and Chen, S.: 1997, Using general impressions to analyse discovered 
classification rules, KDD97: Proceedings of the Third International Conference on 
Knowledge Discovery and Data Mining, AAAI Press, pp. 31-36. 

Matheus, C. J., Piatetsky-Shapiro, G. and McNeill, D.: 1996, Selecting and reporting 
what is interesting, in U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthu- 
rusamy (eds). Advances in Knowledge Discovery and Data Mining, AAAI Press, 
pp. 465-515. 

Padmanabhan, B. and Tuzhilin, A.: 1998, A belief-driven method for discovering un- 
expected patterns, KDD98: Proceedings of the Fourth International Conference on 
Knowledge Discovery and Data Mining, AAAI Press. 

Radcliffe, N. ,1. and Surry, P. D.: 1994, Cooperation through hierarchical competition in 
genetic data mining. Technical Report EPCC-TR94-09, University of Edinburgh, 
Edinburgh Parallel Computing Centre, King’s Buildings, University of Edinburgh, 
Scotland, EH9 3JZ. 

Silbershatz, A. and Tuzhilin, A.: 1995, On subjective measures of interestingness in 
knowledge discovery, in U. M. Fayyad and R. Uthurusamy (eds), KDD95: Pro- 
ceedings of the First International Conference on Knowledge Discovery and Data 
Mining, AAAI Press, pp. 275-281. 

Silbershatz, A. and Tuzhilin, A.: 1996, What makes patterns interesting in knowl- 
edge discovery systems, IEEE Transactions on Knowledge and Data Engineering 
8(6), 970-974. 

Teller, A. and Velosa, M.: 1995, Program evolution for data mining. The International 
Journal of Expert Systems 8(3), 216-236. 

Turney, P. D.: 1995, Cost-sensitive classification: Empirical evaluation of a hybrid ge- 
netic decision tree induction algorithm. Journal of Artificial Intelligence Research 
2, 369-409. 

Venturini, G., Slimane, M., Morin, F. and de Beauville, .J.-P. A.: 1997, On using inter- 
active genetic algorithms for knowledge discovery in databases, in T. Back (ed.). 
Proceedings of the Seventh International Conference on Genetic Algorithms, Mor- 
gan Kaufmann, pp. 696-703. 

Williams, G. J. and Huang, Z.: 1997, Mining the knowledge mine: The Hot Spots 
methodology for mining large, real world databases, in A. Sattar (ed.). Advanced 
Topics in Artificial Intelligence, Vol. 1342 of Lecture Notes in Computer Science, 
Springer- Verlag, pp. 340-348. 



Efficient Search of Reliable Exceptions 



Huan Liu^, Hongjun Lu^*, Ling Feng^, and Farhad Hussain^ 

^ School of Computing, National University of Singapore 
Singapore, 117599 
{liuh, f arhad}@comp .nus . edu. sg 
^ Department of Computer Science 
Hong Kong University of Science and Technology 
® Department of Computing, Hong Kong Polytechnic University 
luhjScs .ust .hk 
cslf engOcomp . polyu . edu . hk 



Abstract. Finding patterns from data sets is a fundamental task of 
data mining. If we categorize all patterns into strong, weak, and random, 
conventional data mining techniques are designed only to find strong 
patterns, which hold for numerous objects and are usually consistent 
with the expectations of experts. While such strong patterns are helpful 
in prediction, the unexpectedness and contradiction exhibited by weak 
patterns are also very useful although they represent a relatively small 
number of objects. In this paper, we address the problem of finding 
weak patterns (i.e., reliable exceptions) from databases. A simple and 
efficient approach is proposed which uses deviation analysis to identify 
interesting exceptions and explore reliable ones. Besides, it is flexible in 
handling both subjective and objective exceptions. We demonstrate the 
effectiveness of the proposed approach through a set of real-life data sets, 
and present interesting findings. 



1 Introduction 

Data mining has attracted much attention from practitioners and researchers in 
recent years. Combining techniques from the fields of machine learning, statistics 
and database, data mining works towards finding patterns from huge databases 
and using them for improved decision making. If we categorize patterns into 

— strong patterns - regularities for numerous objects; 

— weak patterns - reliable exceptions representing a relatively small number of 
objects; and 

— random patterns - random and unreliable exceptions. 

Conventional data mining techniques are only designed to find the strong 
patterns which have high predictive accuracy or correlation. This is because we 
normally want to find such kinds of patterns that can help the prediction task. 
However, in certain tasks of data mining, we may seek more than predicting: as 
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strong patterns are usually consistent with the expectations of experts, we want 
to know what is in the data that we do not know yet. Therefore, in some cases, 
we are more interested in finding out those weak patterns outside the strong 
ones. Usually, such patterns (reliable exceptions) are unknown, unexpected, or 
contradictory to what the user believes. Hence, they are novel and potentially 
more interesting than strong patterns to the user. For example, if we are told that 
“some kind of jobless applicants are granted credit” , that will be more novel and 
interesting, as compared to “jobless applicants are not granted credit” . Moreover, 
an exception rule is often beneficial since it differs from a common sense rule 
which is often a basis for people’s daily activity [16]. 



1.1 Insufficient Support from Current Data Mining Techniques 

Most current data mining techniques cannot effectively support weak pattern 
mining. Taking association rule mining as an example, all association rules 
X ^ Y with support s and confidence c in a database are found in two phases. 
In the first expensive phase, the database is searched for all frequent itemsets, 
i.e., sets of items that occur together in at least s% of instances (records) in the 
database. A fast level- wise algorithm Apriori [1,2] was developed to address the 
nature of this problem. It derives the frequent fc-itemsets from frequent (fc — 1)- 
itemsets. Apriori constructs a candidate set Ck from frequent itemset Lfc_i, 
counts the number of occurrences of each candidate itemset, and then deter- 
mines the set Lk of frequent itemsets based on pre-determined support s. 

With the nice downward closure property (i.e., if a (fc -I- l)-itemset has sup- 
port s, all its fc-subsets also have support s), Apriori can efficiently cross out 
some (fc-|- l)-itemsets which cannot have s, and only check the remaining (fc-l-l)- 
itemsets. For instance, suppose Sup{{jobless,not_granted}) = .70 and 

Sup{{ jobless, granted}) = .30. By setting support threshold as .50, the strong 
frequent itemset {jobless, not_gr anted} is kept while the latter is pruned right 
away, and no further itemset like {jobless, female, granted} will be explored. 
The search space is thus reduced greatly. 

After obtaining all frequent itemsets, the second phase will generate associa- 
tion rules from each frequent itemset so that more than c% of records that con- 
tain X also contain Y. Since relatively infrequent itemsets like {jobless, female, 
granted} are omitted, some reliable exception rules such as jobless A female 
granted cannot be discovered although they show high confidence. 



1.2 Intuitive Extensions Cannot Help Either 

There are some intuitive ways we can think of to generate these weak patterns 
using traditional data mining techniques. To find exception association rules, for 
example, we can introduce two support thresholds [si, S 2 ] as delimiters instead of 
only one [si, 100%], and search those itemsets whose support values fall into this 
range of [si, S 2 ]. However, in such case, the efficient pruning property - downward 
closure no longer exists, which may lead to an intolerable mining performance. 
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Another straightforward method is generate-and-remove: for a data set D, 
select the favorite induction algorithm, say C4.5 [12]; generate rules R on D\ 
remove D' from D that are covered correctly by R] then generate rules R' on 
D—D', repeat the process until no data is left. We can be sure that this procedure 
can find us many weak patterns. However, it is obvious that these weak patterns 
are induced independently of the strong patterns. An immediate problem is that 
there could be too many of weak rules. Although there are many, some interesting 
exceptions may be missed due to the bias of a chosen induction algorithm. 

Generating all possible rules Rail is also a possible approach. It is certain 
that Rail will contain any interesting weak rules. However, there could be too 
many rules - it could be much more than the number of instances of data. 

1.3 Related Work 

Generally, there are two schools in finding interesting and unexpected patterns: 
subjective vs. objective. The major objective factors used to measure the in- 
terestingness of discovered patterns include strength, confidence, coverage and 
simplicity [10,3,13]. 

Not all strong rules with high confidence and support are interesting as they 
may correspond to some prior knowledge or expectations. The importance of 
prior knowledge cannot be over-stressed [4]. [14,9] propose two subjective mea- 
surements: unexpectedness and actionability, and define the unexpected measure 
of interestingness in terms of a belief system. [-5] uses rule templates to find in- 
teresting rules from the set of all discovered association rules. [6] analyze the 
discovered classification rules against a set of general impressions that are spec- 
ified using a representation language. Only those patterns conforming to these 
impressions are regarded as unexpected. The issue on interesting deviations is 
also discussed in [11]. 

As the interestingness of a pattern itself is subjective, a rule could be inter- 
esting to one user but not interesting to another. Thus, most of the previous 
work so far rely on users to distinguish those reliable exception patterns based 
on the existing concepts. One potential problem raised is that users’ subjec- 
tive judgments may be unreliable and uncertain in case the discovered rules are 
numerous. Recently, [16] proposes an autonomous probabilistic estimation ap- 
proach which can discover all rule pairs (i.e., an exception rule associated with a 
common sense rule) with high confidence. Neither users’ evaluation nor domain 
knowledge is required. 

1.4 Our Work 

In this paper, we address the problem of finding weak patterns, i.e., reliable 
exceptions, from datasets. A simple yet flexible approach is proposed which works 
around the strong patterns and uses deviation analysis to find reliable exceptions. 
Different from previous work [16], we shun searching for strong common sense 
patterns; instead, we directly identify those exceptional instances and mine weak 
patterns from them. Besides the promised efficiency for weak pattern mining. 
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the proposed method can also handle both subjective and objective exceptions. 
To demonstrate the effectiveness of this novel method, we apply it to a set of 
real-life data sets [8], and indeed find out some interesting exception patterns. 
From the mushroom data set, we can achieve more reliable exceptions, compared 
to the results from [16]. The remainder of the paper is organized as follows: 
section 2 gives a detailed description of the proposed approach. Experiments are 
conducted and some interesting exceptions are presented in section 3. Section 4 
concludes this work. 



2 A Focused Approach 



Our approach is based on the following observations: (1) Any exception would 
have a low support [2] found in the data, otherwise it might be a strong pat- 
tern. (2) A reasonable induction algorithm can summarize data and learn rules. 
(3) Attributes in the rules are salient features. Observation (1) suggests that 
exceptions cannot be discovered from the data by applying standard machine 
learning techniques. Observations (2) and (3) allow us to focus on the impor- 
tant features so that we are more focused and an efficient method is possible for 
finding reliable exceptions. Our approach consists of the four phases below: 

1. Rule induction and focusing. 

This phase obtains the strong patterns. Normally, a user can stop here for 
his preliminary data mining probing. If the number of rules is too many, the user 
can choose to focus on the strongest rules or follow the suggestions in [6]. Let’s 
assume that a few rules have caught our attention and we are curious to know 
any reliable exceptions with respect to these strong patterns. This is a filtering 
step that helps us focus on a few attributes quickly. If we are confident about 
what we want to investigate, i.e., we know the relevant attributes, then this step 
can be replaced by user specification of relevant attributes. 

2. Contingency table and deviation. 

Now we focus on some attribute in a rule, we can use these attributes and the 
class attribute to build a two-way contingency table that allows us to calculate 
deviations [15]^. 



Class 


Attribute 


R- Total 


Vi 


V2 




Vc 


Cl 


(nii)xn 


(ni2')Xi2 




(nic)xic 


ni. 


C2 


(n2l')X21 


{u22)x22 




{n2c)x2c 


n2. 


Cr 


(Url )^rl 


(^r2)^r2 




(^rc)^rc 


Ur. 


C-Total 


n.i 


n.2 




n.c 


n 



^ For more than one attribute, we need to resort to multi-way contingency tables. 
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In the table, Xij are the frequencies of occurrence found in the data, = 
nin.j/n which are the expected frequencies of occurrence, n = “ 

total, n,j = ^ column total (C- Total), and rii, = YTj=i i® ^ 

total (R- Total). Using the expected frequency as the norm, we can define the 
deviation as 

^ _Xij - nij 

Oij — 

riij 

3. Positive, negative, and outstanding deviations. 

Using the above definition to calculate deviations, we can expect to have three 
types: positive, zero, or negative. If the deviation is positive, it suggests that what 
is concerned is consistent with the strong patterns; if it is zero, it is the norm; and 
if it is negative, what’s concerned is inconsistent with the strong patterns. The 
value of 6 displays the magnitude of deviation. A large value means the deviation 
could be caused by chance. Since we are concerned about reliable exceptions and 
reliability is subject to the user’s need, we specify a threshold St to define how 
reliable is reliable. If St > 0, any deviation S > St is positively outstanding; 
if St < 0, any deviation 5 < i5t is negatively outstanding. Deviations are powerful 
and useful in this case as they provide a simple way of identifying interesting 
patterns in the data. This has been shown in the KEFIR [7] application. 

4. Reliable exceptions. 

After we have identified outstanding, negative deviations of attribute-values 
with respect to the class label, we can get all the instances that contain these 
attribute- value and class pairs, and perform further mining on the focused 
dataset - a window, using any data mining techniques we prefer. For example, we 
can continue searching for frequent itemsets to investigate the above exceptional 
combinations. As the number of instances is much smaller now than the original 
one, the mining performance should improve greatly. Reliable exceptions can be 
those longest frequent itemsets with high support in the window. 

A strong association rule found in the window may itself be a strong rule that 
can be found in the whole data set. We need to make sure what we find are weak 
patterns - low support but high confidence. In other words, any sub-itemsets 
found in a reliable exception should be excluded if they hold high support in the 
whole data. A simple method is like this: assuming that A is a sub-itemset that 
does not include the negatively deviated attribute in a strong association rule 
found in the window, we can compare the support supwin of X in the window 
and its counterpart sup^^ho in the whole data. Note that what we really want 
to check is P{X, c) for the window and P{X) for the whole data with respect 
to A — !■ c. If we consider their ratio, they are actually the confidence values. 
Therefore, if the difference confwin — confwho is sufficiently large (as confyjin 
is always 1, the large difference means a low con f who), we should be satisfied 
that A's high confidence is unique to the window, otherwise, A does not have 
sufficient evidence to be included in the final reliable exception. 

From exception rules to common sense and reference rules 

Before we verify the proposed approach, we would like to say a few words 
about the exceptions found by our approach and those defined in [16]. Briefly, 
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Suzuki suggested that for an exception rule, we should also be able to find its 
corresponding common sense rule and reference rule. A reference rule should hold 
low support and low confidence. Due to the nature of negative deviations, it is 
obvious that every exception rule found by our approach has a corresponding 
common sense rule (with positive deviation) . A reference rule can also be found 
in our effort as we remove high confidence sub-itemsets mentioned in the previous 
paragraph. In summary, common sense and reference rules can be obtained based 
on exception rules. 



3 Searching for Exceptions from Data Sets 

In this section, we give detailed accounts of applying our approach to bench 
mark data sets [8]. 



3.1 Experiment 1 - The Credit Data Set 

The first data set used is the credit database which includes 10 attributes and 125 
instances, recording situations where people are or aren’t granted credit. 

Step 1. Identifying relevant attributes 

We ran C4.5rules on the data and obtained 5 rules in which the condition 
part of 3 rules contains attribute “Jobless”. We thus consider it as relevant, and 
the class attribute is “Credit” . As we can see, this step could be objective - the 
relevant attributes are obtained from an algorithm, or subjective - the relevant 
attributes are specified by an expert or user. 

Step 2. Building contingency tables 

Table 1 is a contingency table about “Jobless” and “Credit” . For continuous 
attributes, the different range groups are drawn directly from the discretized 
results generated by the standard equal width method. The objective of the 
table is to determine whether the two directions of the pairings of “Jobless” and 
“Credit” are independent (as in the test). Each cell of the table represents one 
of the fc = 2 X 2 = 4 categories of a two-directional classification of the n = 125 
observations. The row-totals are ni. = 40 for “Credit = No” and ri 2 . = 85 
for “Credit = Yes”, while the column-totals are n,i = 111 for “Jobless = No” 
and n ,2 = 14 for “Jobless = Yes”. For example, Xi 2 represents the (observed 
counts) number of applicants that are jobless and not granted a credit, while ni 2 
represents the corresponding expected counts. 

In contingency table analysis, if the two attributes are independent, the prob- 
ability that an item is classified in any particular cell is given by the product of 
the corresponding marginal probabilities. This is because in a statistical sense, 
we know that independence of events A and B implies P( AAB) = P(A)P(B). It 
can be shown that the null hypothesis that the directions of classification are in- 
dependent (in Table 1) is equivalent to the hypothesis that every cell probability 
is equal to the product of its respective row and column marginal probabilities. 
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Table 1. (Expected) and observed cell counts for the “jobless” classification 



Class 


Attribute 


=Jobless 




grant 


No 


Yes 


R- Total 


No 


(35.5) 28 


(4.5) 12 


40 


Yes 


(75.5) 83 


(9.5) 02 


85 


C- Total 


111 


14 


125 



Step 3. Identifying outstanding negative deviations 

In Table 1, under the assumption of independence, we can compute the devi- 
ation 6ij of the observed counts from the expected counts as )■ Table 2 

shows the deviations, sorted by their magnitudes. 



Table 2. Deviation analysis of the jobless classification. 



Jobless 


Class 


Xij 


riij 


5 


Yes 


No 


12 


4.5 


* -bl.68 


Yes 


Yes 


02 


9.5 


* -.79 


No 


No 


28 


35.5 


-.21 


No 


Yes 


83 


75.5 


+.10 



If the default threshold is St = —.30, we have two outstanding deviations de- 
noted by in Table 2. The large positive deviation is consistent with the 
trend suggested by the “norm”. It is the large negative deviation that can 
lead to a reliable exception. The pattern “a jobless applicant is not given a 
credit grant” exhibits the most significant “positive deviation” at (5=-|-1.68. 
Such patterns are always statistically significant, intuitive and represented by 
strong patterns. For example, the association below has a very high confidence. 

Jobless=“Yes”, — > class=“No” [confidence = .857] 

The outstanding negative deviation (J = —.79) says that “Jobless” applicants 
are given “Credit” . This is intuitively wrong, or unexpected, and naturally wor- 
thy of probing. 

Step 4. Finding reliable exceptions 

Following the findings suggested by the outstanding negative deviation, we 
gather the instances that satisfy Jobless = “Yes” and Credit = “Yes”, or “the 
jobless being granted a credit” . The satisfied instances form a window, the num- 
ber of instances defines the window size. Applying Apriori, for this application, 
we obtain the weak pattern “A female jobless with little working experience is 
given credit” after removing those items with high overall confidence (> 0.70) 
such as Area = “Good” (0.72) and Married = “Yes” (0.71). 
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The number of exception candidates is determined by 5t - the higher the 
threshold, the fewer the candidates. Although there may not be exceptions for 
every individual attribute, the more attributes, the more likely that we can find 
more exceptions. 

3.2 Experiment 2 - The Mushroom Data Set 

This data set is obtained from [8]. It has 22 attributes, each has 2 to 12 pos- 
sible values, and 8124 instances with binary class. The two types (classes) of 
mushrooms are poisonous or edible. For this data set, we want to verify if our 
approach can find the exceptions found in [16]. 

Step 1. Identifying relevant attributes 

Since all the four rules in [16] contain Stalk-root = ?, we select Stalk-root 
and the class for the experiment. 

Step 2. Building contingency tables 

This step is routine and results are shown in the table below. We only include 
expected frequencies for “Stalk-root = ?”. 

Contingency table for the “Stalk-root” classification. 



Type 


Attribute= 


=Stalk-root 






b 


c 


e 


r 


? 


R- Total 


P 


1856 


44 


256 


0 


(1195)1760 


3916 


e 


1920 


512 


864 


192 


(1285) 720 


4208 


C- Total 


3776 


556 


1120 


192 


2480 


8124 



Step 3. Identifying outstanding negative deviations 

In the table below, we show the deviations for “Stalk-root = ?” and omit 
others. The outstanding negative deviation is -.44 for “Type = e”. 

Deviation analysis of the “Stalk-root = ?” classification. 



Stalk-root 


Type 


Xij 


Uij 


(5 


7 


P 


1760 


1195 


-I-.47 


7 


e 


720 


1285 


-.44 













Step 4. Finding reliable exceptions 

Using “Stalk-root = ?” and “Type = e”, we find a window with 720 instances. 
If we simply run Apriori on the window to find frequent itemsets, we obtain 
two itemsets (recall that confidence for the window is always 1). For exception 
candidate number 1, we have the following itemset. “Chosen” below indicates 
whether an item is chosen to remain in the itemset. 



Attribute- Value 


Overall Conf 


Chosen 


stalk-root = ? 


.29 


Y 


veil-type = p 


.52 


Y 


stalk-shape = e 


.46 


Y 


gill-size = b 


.70 


Y 


bruises = f 


.31 


Y 


ring-type = p 


.79 


N 
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After removing those with high overall confidence (> .70), we have rule 1 as 
found in [16] repeated here. 

CS: bruises=f, g-size=b, stalk-shape=e ^ class=p 
RE: C, stalk-root=? ^ class=e 
RR: stalk-root=? — > class=e 

where CS is a common sense rule, RE is a reliable exception, C in RE is the 
conditional part of CS, and RR is a reference rule. 

For exception candidate number 2, after removing items with high overall 
confidence. We obtain a super-itemset for exception rules 2, 3, and 4 in [16] as 
follows. We repeat the three rules below. 



Attribute- Value 


Overall Conf 


Chosen 


stalk-root = ? 


.29 


Y 


spore-print-color = w 


.24 


Y 


veil-color = w 


.51 


Y 


veil-type = p 


.52 


Y 


stalk-shape = e 


.46 


Y 


gill-size = b 


.70 


Y 


gill-attachment = f 


.51 


Y 


ring-number = t 


.88 


N 


odor = n 


.97 


N 



CS: g-attachment=f, stalk-root=? — > class=p 
#2 RE: C, g-size=b, stalk-shape=3, veil-color=w — > class=e 

RR: g-size=b, stalk-shape=3, veil-color=w ^ class=e 

CS: staik-root='.'’, sp-coior=w — > ciass=p 
#3 RE: C, g-attachment=f, g-size=b, stalk-shape=e — > class=e 

RR: g-attachment=f, g-size=b, stalk-shape=e ^ class=e 

CS: staik-root^V, sp-coior=w — > ciass=p 
#4 RE: C, g-size=b, stalk-shape=3, veil-color=w ^ class=e 

RR: stalk-root=? — > class=e 



4 Conclusion 

We propose here a simple approach that enables us to study reliable exceptions 
with respect to a rule of our interest or attributes specified by a user. The 
major techniques are deviation analysis, windowing, and conventional mining 
tools (e.g., Apriori for association rule mining). For the concerned attributes, we 
first find their negative deviations which determine the window, and then search 
reliable patterns from the window using any data mining tools we want. Reliable 
exceptions are those patterns that are only valid in the window. 

This approach is efficient because (1) it works around focused attributes, 
thus avoiding search all attributes in the database; (2) we are only concerned 
about negative deviations; (3) it only scan the data once to create the window; 
and (4) the window size (i.e., number of instances) is usually much smaller than 
the number of instances in the data. Besides, such approach is also flexible to 
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handle both subjective and objective prior knowledge. Weak pattern finding is 
an important area for effective, actionable, and focused data mining. 
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Abstract. We describe heuristics, based upon information theory and 
statistics, for ranking the interestingness of summaries generated from 
databases. The tuples in a summary are unique, and therefore, can be 
considered to be a population described by some probability distribu- 
tion. The four interestingness measures presented here are based upon 
common measures of diversity of a population: variance, the Simpson 
index, and the Shannon index. Using each of the proposed measures, we 
assign a single real value to a summary that describes its interesting- 
ness. Our experimental results show that the ranks assigned by the four 
interestingness measures are highly correlated. 



1 Introduction 

Knowledge discovery from databases (KDD) is the nontrivial process of identify- 
ing valid, previously unknown, potentially useful patterns in data [3,4]. However, 
the volume of data contained in a database often exceeds the ability to analyze 
it efficiently, resulting in a gap between the collection of data and its under- 
standing [4]. A number of successful algorithms for KDD have previously been 
developed. One particular summarization algorithm, attribute- oriented general- 
ization (AOG), has been shown to be among the most efficient of methods for 
KDD [1]. AOG summarizes in a database by replacing specific attribute values 
with more general concepts according to user-defined concept hierarchies (GHs). 

Until recently, AOG methods were limited in their ability to efficiently gen- 
erate summaries when multiple GHs were associated with an attribute. To re- 
solve this problem, we previously introduced new serial and parallel AOG algo- 
rithms [8] and a data structure called a domain generalization graph 
(DGG) [5,8,10]. A DGG defines a partial order representing the set of all possible 
summaries that can be generated from a set of attributes and their associated 
GHs. However, when the number of attributes to be generalized is large, or 
the DGG associated with a set of attributes is complex, many summaries may 
be generated, requiring the user to manually evaluate each one to determine 
whether it contains an interesting result. In this paper, we describe heuristics, 
based upon information theory and statistics, for ranking the interestingness of 
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summaries generated from databases. Our objective is to assign a single real 
value to a summary that describes its interestingness. 

Although our measures were developed and utilized for ranking the inter- 
estingness of generalized relations, they are more generally applicable to other 
problem domains, such as ranking views (i.e., precomputed, virtual tables de- 
rived from a relational database) or summary tables (i.e., materialized, aggregate 
views derived from a data cube). However, we do not dwell here on the technical 
aspects of deriving generalized relations, views, or summary tables. Instead, we 
simply refer collectively to these objects as summaries, and assume that some 
collection of them is available for ranking. 

The remainder of this paper is organized as follows. In the next section, we 
describe heuristics for ranking the interestingness of summaries and provide a 
detailed example for each heuristic. In the third section, we present a summary 
overview of the experimental results. In the last section, we conclude with a 
summary of our work and suggestions for future research. 



2 Interestingness 

We formally define the problem of ranking the interestingness of summaries, 
as follows. Let a summary S' be a relation defined on the columns {(Ai,Z?i), 
(A 2 , D 2 ), . . . , {An, Dn)}, where each {Ai, Di) is an attribute-domain pair. Also, 
let {(Ai, Uii), (A 2 , Ui 2 ), . . • , {An, Vin)}, i = 1,2, ... ,m, be a set of m unique tu- 
ples, where each {Aj,Vij) is an attribute- value pair and each Vij is a value from 
the domain Dj associated with attribute Aj . One attribute A„ is a derived at- 
tribute, called Count, whose domain Dn is the set of positive integers, and whose 
value Vin for each attribute- value pair {An,Vin) is equal to the number of tuples 
which have been aggregated from the base relation (i.e., the unconditioned data 
present in the original relational database). The interestingness / of a summary S 
is given by / = /(S), where / is typically a function of the cardinality and degree 
of S, the complexity of its associated CHs, or the probability distributions of the 
tuples in S. 

A sample summary is shown below in Table 1. In Table 1, there are n = 3 
attribute-domain pairs and m = 4 sets of unique attribute- value pairs. A Tuple 
ID attribute is being shown for demonstration purposes only and is not actually 
part of the summary. Table 1 will be used as the basis for all calculations in the 
examples that follow. 

Table 1. A sample summary 



Tuple ID 


Colour 


Shape 


Count 1 


*1 


red 


round 


3 


*2 


red 


square 


1 


*3 


blue 


square 


1 


*4 


green 


round 


2 



We now describe four heuristics for ranking the interestingness of summaries 
and provide a detailed example for each heuristic. The tuples in a summary are 
unique, and therefore, can be considered to be a population described by some 
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probability distribution (derived from the Count attribute). The well-known, 
domain-independent formulae, upon which these heuristics are based, have been 
selected for evaluation as interestingness measures because they are common 
measures of diversity of a population, and have previously seen extensive ap- 
plication in several areas of the physical, social, management, and computer 
sciences. The lyar measure is based upon variance, which is the most common 
measure of variability used in statistics [11]. The lavg and Itot measures, based 
upon a relative entropy measure (also known as the Shannon index) from infor- 
mation theory [12], measure the average information content in a single tuple 
in a summary, and the total information content in a summary, respectively. 
The Icon measure, a variance-like measure based upon the Simpson index [13], 
measures the extent to which the counts are distributed over the tuples in a 
summary, rather than being concentrated in any single one of them. 

In the discussion that follows, the probabiltiy of each ti occurring in the 
actual probability distribution of S is given by: 



p{u) 



'^in 

{'^In '^2n H“ . . . H“ Vmn) 



and the probability of each ti occurring in a summary where the tuples have a 
uniform probability distribution is given by: 



q{ti) 



(Vln+V2n, + --- + Vm.n,) 

m 

i'^ln “t” '^2n -f . . . -f Vjrin^ 



1 

m’ 



where Vin is the value associated with the Count attribute An in tuple ti. 



2.1 The Inar Measure 

Given a summary S, we can measure how far the actual probability distribution 
(hereafter called simply a distribution) of the counts for the t^’s in S varies from 
that of a uniform distribution. The variance of the distribution in S from that 
of a uniform distribution is given by: 

-^var — 5 

m — 1 

where higher values of lyar are considered more interesting. For example, from 
the actual distribution of the tuples in Table 1, we have p{ti) = 0.429, 
Pih) = 0.143, pits) = 0.143, p{t 4 ) = 0.286, and from a uniform distribution 
of the tuples, we have q{ti) = 0.25, for all i. So, the interestingness of the sum- 
mary using the lyar measure is: 

= ((0.429 - 0.25)2 -k (0.143 - 0.25)^ -k 

(0.143 - 0.25)2 -k (0.286 - 0.25)2)/3 

= 0.018. 

Note that we use m — 1 in the denominator of our calculation for variance 
because we assume the summary may not contain all possible combinations of 
attributes, meaning we are not observing all of the possible tuples. 
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2.2 The lavg Measure 

Given a summary S, we can determine the average information content in each 
tuple. The average information content, in bits per tuple, is given by: 

m 

lavg = -'^p{ti)l0g2p{ti), 
i=l 

where lower values of lavg are considered more interesting. For example, the 
actual distribution for p is given in Example 1. So, the interestingness of the 
summary using the lavg measure is: 

= - (0.429 log2 0.429 + 0. 143 log2 0.143 + 

0.143 log2 0.143 + 0.286 log2 0.286) 

= 1.842 bits. 



2.3 The Itot Measure 

Given a summary S, we can determine its total information content. The total 
information content, in bits, is given by: 

^tot — ^ * ^avg 1 

where lower values of Itot are considered more interesting. For example, the 
average information content for the summary is given in Example 2. So, the 
interestingness of the summary using the Itot measure is: 

Itot = 4 * 1.842 
= 7.368 hits. 



2.4 The Icon Measure 

Given a summary S, we can measure the extent to which the counts are dis- 
tributed over the tuples in the summary. The concentration of the distribution 
in S is given by: 

m 

Icon — ^ 5 

i=l 

where higher values of Icon are considered more interesting. For example, the 
actual distribution for p is given in Example 1. So, the interestingness of the 
summary using the Icon measure is: 



Icon = 0.429^ + 0.143^ + 0.143^ + 0.286^ 
= 0.306. 
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3 Experimental Results 

In this section, we present a summary overview of the experimental results. All 
summaries in our experiments were generated using DB-Discover [1,2], a soft- 
ware tool which uses AOG for KDD. To generate summaries, series of discovery 
tasks were run on the NSERC Research Awards Database (available in the pub- 
lic domain) and the Customer Database (confidential database supplied by an 
industrial partner). Both have been frequently used in previous data mining 
research [9, 2, 6, 8]. Similar results were obtained from both the NSERC and Cus- 
tomer discovery tasks, however, we restrict our discussion to those obtained from 
the NSERC discovery tasks. 

Our experiments show some similarities in how the four interestingness mea- 
sures rank summaries [7] . To determine the extent of the ranking similarities, we 
can calculate the Camma correlation coefficient for each pair of interestingness 
measures. The Camma statistic assumes that the summaries under considera- 
tion are assigned ranks according to an ordinal (i.e., rank order) scale, and is 
a probability computed as the difference between the probability that the rank 
ordering of two interestingness measures agree minus the probability that they 
disagree, divided by 1 minus the probability of ties. The value of the Camma 
statistic varies in the interval [—1,1], where values near 1, 0, and -1 represent 
significant positive, no, and significant negative correlation, respectively. 

The Camma correlation coefficients (hereafter called the coefficients) for the 
two-, three-, and four-attribute discovery tasks are shown in Table 2. In Ta- 
ble 2, the Interestingness Measures column describes the pairs of interestingness 
measures being compared, and the N-2, N-3, and N~4 columns describe the coef- 
ficients corresponding to the pairs of interestingness measures in the two-, three-, 
and four-attribute NSERC discovery tasks, respectively. 

Table 2. Comparison of ranking similarities 



Interestingness 

Measures 


Gamma Correlation Coefficient 


N-2 


N-S 


N-i 


Average 


Ivar ^ lavg 


0.94737 


0.95670 


0.96076 


0.95494 




0.92983 


0.86428 


0.91904 


0.90438 


Ivar Icon 


0.91228 


0.93172 


0.94929 


0.93110 


lavg & Itot 


0.91228 


0.86356 


0.90947 


0.89510 


lavg ^ Icon 


0.94737 


0.96506 


0.94516 


0.95253 


Itnt ^ Icon 


0.85965 


0.82862 


0.86957 


0.85261 


Average 


0.91813 


0.90166 


0.92555 


0.91511 



The coefficients vary from a low of 0.82862 for the pair containing the Itot 
and Icon measures in the three-attribute discovery task, to a high of 0.96506 
for the pair containing the lavg and Icon measures in the same discovery task. 
The ranks assigned by the pair containing the lyar and lavg measures are most 
similar, as indicated by the average coefficient of 0.95494 for the two-, three-, 
and four-attribute discovery tasks, followed closely by the ranks assigned to the 
pair containing the lavg and Icon measures, as indicated by the average coef- 
ficient of 0.95253. The ranks assigned by the pairs of interestingness measures 
in the three-attribute discovery task have the least similarity, as indicated by 
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the average coefficient of 0.90166, although this is not significantly lower than 
the two- and four-attribute average coefficients of 0.91813 and 0.92555, respec- 
tively. Given the overall average coefficient is 0.91511, we conclude that the ranks 
assigned by the four interestingness measures are highly correlated. 



4 Conclusion and Future Research 

We described heuristics for ranking the interestingness of summaries generated 
from databases. The hueristics are simple, easy to use, and have the potential to 
provide a reasonable starting point for further analysis of discovered knowledge. 
Future research will focus on further experimental evaluation of the measures 
proposed here, and other measures of diversity [7]. 
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Abstract. One of the most important problems on rule induction meth- 
ods is that they cannot extract rules, which plausibly represent experts’ 
decision processes. On one hand, rule induction methods induce proba- 
bilistic rules, the description length of which is too short, compared with 
the experts’ rules. On the other hand, construction of Bayesian networks 
generates too lengthy rules. In this paper, the characteristics of experts’ 
rules are closely examined and a new approach to extract plausible rules 
is introduced, which consists of the following three procedures. First, the 
characterization of decision attributes (given classes) is extracted from 
databases and the classes are classified into several groups with respect 
to the characterization. Then, two kinds of sub-rules, characterization 
rules for each group and discrimination rules for each class in the group 
are induced. Finally, those two parts are integrated into one rule for 
each decision attribute. The proposed method was evaluated on medi- 
cal databases, the experimental results of which show that induced rules 
correctly represent experts’ decision processes. 



1 Introduction 

One of the most important problems in developing expert systems is knowledge 
acquisition from experts [4]. In order to automate this problem, many induc- 
tive learning methods, such as induction of decision trees [3,12], rule induction 
methods [7,9,12,13] and rough set theory [10,15,19], are introduced and applied 
to extract knowledge from databases, and the results show that these methods 
are appropriate. 

However, it has been pointed out that conventional rule induction methods 
cannot extract rules, which plausibly represent experts’ decision processes [15,16]; 
the description length of induced rules is too short, compared with the experts’ 
rules. For example, rule induction methods, including AQ15 [9] and 
PRIMEROSE [15], induce the following common rule for muscle contraction 
headache from databases on differential diagnosis of headache [16]: 

[location = whole] A[Jolt Headache = no] A[Tenderness of Ml = yes] 
muscle contraction headache. 
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This rule is shorter than the following rule given by medical experts. 

[Jolt Headache = no] A [Tenderness of Ml = yes] 

A [Tenderness of B1 = no] A [Tenderness of Cl = no] 

—>• muscle contraction headache, 

where [Tenderness of B1 = no] and [Tenderness of Cl = no] are added. 

These results suggest that conventional rule induction methods do not reflect 
a mechanism of knowledge acquisition of medical experts. 

In this paper, the characteristics of experts’ rules are closely examined and 
a new approach to extract plausible rules is introduced, which consists of the 
following three procedures. First, the characterization of each decision attribute 
(a given class), a list of attribute- value pairs the supporting set of which covers 
all the samples of the class, is extracted from databases and the classes are 
classified into several groups with respect to the characterization. Then, two 
kinds of sub-rules, rules discriminating between each group and rules classifying 
each class in the group are induced. Finally, those two parts are integrated 
into one rule for each decision attribute. The proposed method is evaluated on 
medical databases, the experimental results of which show that induced rules 
correctly represent experts’ decision processes. 



2 Rough Set Theory and Probabilistic Rules 

In this section, a probabilistic rule is defined by the use of the following three 
notations of rough set theory [10]. The main ideas of these rules are illustrated 
by a small database shown in Table 1. 

First, a combination of attribute-value pairs, corresponding to a complex in 
AQ terminology [9], is denoted by a formula R. 

Secondly, a set of samples which satisfy R is denoted by [x]r, corresponding 
to a star in AQ terminology. For example, when {2,3,4, 5} is a set of samples 
which satisfy [age = 40. ..49], [a;][a 3 e= 4 o... 49 ] is equal to (2,3,4, 5}. ^ 

Finally, U, which stands for “Universe”, denotes all training samples. 

Using these notations, we can define several characteristics of classification 
in the set-theoretic framework, such as classification accuracy and coverage. 
Classification accuracy and coverage (true positive rate) is defined as: 

aniD) = P(D]R)), and 

INfll 

^r{D) = P{R\D)), 

where |A| denotes the cardinality of a set A, an{D) denotes a classification 
accuracy of R as to classification of D, and kr{D) denotes a coverage, or a 
true positive rate of i? to U, respectively.^ It is notable that these two measures 

^ In this notation, “n” denotes the nth sample in a dataset (Table 1). 

^ Those measures are equivalent to confidence and support dehned by Agrawal [1]. 
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Table 1. An Example of Database 



age loc nat prod 


nau 


Ml 


class 


1 50. ..59 occ per 


no 


no 


yes 


m.c.h. 


2 40. ..49 who per 


no 


no 


yes 


m.c.h. 


3 40. ..49 lat thr 


yes 


yes 


no 


migra 


4 40. ..49 who thr 


yes 


yes 


no 


migra 


5 40. ..49 who rad 


no 


no 


yes 


m.c.h. 


6 50. ..59 who per 


no 


yes 


yes 


psycho 



Definitions: loc: location, nat: nature, prod: 
prodrome, nau: nausea. Ml: tenderness of Ml, 
who: whole, occ: occular, lat: lateral, per: 
persistent, thr: throbbing, rad: radiating, 
m.c.h.: muscle contraction headache, 
migra: migraine, psycho: psychological pain. 



are equal to conditional probabilities: accuracy is a probability of D under the 
condition of R, coverage is one of R under the condition of D. 

According to the definitions, probabilistic rules with high accuracy and cov- 
erage are defined as: 

R^ d s.t. R = ViRi = V Aj [aj = Vk], 

OiRi{D) > 5a and KRi{D) > i5„, 

where 5a and (5^ denote given thresholds for accuracy and coverage, respectively. 
For the above example shown in Table 1, probabilistic rules for m.c.h. are given 
as follows: 

[Ml = yes] m.c.h. a = 3/4 = 0.75, k = 1.0, 

[nau = no] m.c.h. a = 3/3 = 1.0, k = 1.0, 

where 5a and 5^ are set to 0.75 and 0.5, respectively. 



Rough Inclusion 

In order to measure the similarity between classes with respect to characteriza- 
tion, we introduce a rough inclusion measure y, which is defined as follows. 



y{S,T) 



i^n^i 

1^1 



It is notable that if S' C T, then ^(S, T) = 1.0, which shows that this relation 
extends subset and superset relations. This measure is introduced by Polkowski 
and Skowron in their study on rough mereology [11]. Whereas rough mereology 
firstly applies to distributed information systems, its essential idea is rough in- 
clusion: Rough inclusion focuses on set-inclusion to characterize a hierarchical 
structure based on a relation between a subset and superset. Thus, application 
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of rough inclusion to capturing the relations between classes is equivalent to 
constructing rough hierarchical structure between classes, which is also closely 
related with information granulation proposed by Zadeh [18]. 



3 Interpretation of Medical Experts’ Rules 

As shown in Section 1, rules acquired from medical experts are much longer than 
those induced from databases the decision attributes of which are given by the 
same experts.^ 

Those characteristics of medical experts’ rules are fully examined not by 
comparing between those rules for the same class, but by comparing experts’ 
rules with those for another class. For example, a classification rule for muscle 
contraction headache is given by: 

[Jolt Headache = no] 

A ([Tenderness of MO = yes] V [Tenderness of Ml = yes] V [Tenderness of M2 = yes]) 

A [Tenderness of B1 = no] A [Tenderness of B2 = no] A [Tenderness of B3 = no] 

A [Tenderness of Cl = no] A [Tenderness of C2 = no] A [Tenderness of C3 = no] 

A [Tenderness of C4 = no] 

^ muscle contraction headache 

This rule is very similar to the following classification rule for disease of cervical 
spine: 

[Jolt Headache = no] 

A ([Tenderness of MO = yes] V [Tenderness of Ml = yes] V [Tenderness of M2 = yes]) 
A ([Tenderness of B1 = yes] V [Tenderness of B2 = yes] V [Tenderness of B3 = yes] 

V [Tenderness of Cl = yes] V [Tenderness of C2 = yes] V [Tenderness of C3 = yes] 

V [Tenderness of C4 = yes]) 

—> disease of cervical spine 

The differences between these two rules are attribute- value pairs, from tenderness 
of B1 to C4. Thus, these two rules can be simplified into the following form: 

Or A A 2 A — > muscle contraction headache 

oi A A 2 A A 3 ^ disease of cervical spine 

The first two terms and the third one represent different reasoning. The first 
and second term oi and A 2 are used to differentiate muscle contraction headache 
and disease of cervical spine from other diseases. The third term A 3 is used to 

® This is because rule induction methods generally search for shorter rules, compared 
with decision tree induction. In the latter cases, the induced trees are sometimes 
too deep and in order for the trees to be learningful, pruning and examination by 
experts are required. One of the main reasons why rules are short and decision trees 
are sometimes long is that these patterns are generated only by one criteria, such 
as high accuracy or high information gain. The comparative study in this section 
suggests that experts should acquire rules not only by one criteria but by the usage 
of several measures. 
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make a differential diagnosis between these two diseases. Thus, medical experts 
firstly selects several diagnostic candidates, which are very similar to each other, 
from many diseases and then make a final diagnosis from those candidates. 

In the next section, a new approach for inducing the above rules is introduced. 

4 Rule Induction 

Rule induction(Fig 1.) consists of the following three procedures. First, the char- 
acterization of each given class, a list of attribute- value pairs the supporting set 
of which covers all the samples of the class, is extracted from databases and 
the classes are classified into several groups with respect to the characterization. 
Then, two kinds of sub-rules, rules discriminating between each group and rules 
classifying each class in the group are induced(Fig 2). Finally, those two parts 
are integrated into one rule for each decision attribute(Fig 3).^ 



Example 



Let us illustrate how the introduced algorithm works by using a small database 
in Table 1. For simplicity, two thresholds Sa and are set to 1.0, which means 
that only deterministic rules should be induced and that only subset and superset 
relations should be considered for grouping classes. 

After the first and second step, the following three sets will be obtained: 
L{m.c.h.) = {[prod = no], [Ml = yes]}, L{migra) = {[age = 40. ..49], [not = 
who], [prod = yes], [nau = yes], [Ml = no]}, and L{psycho) = {[age = 50. ..59], 
[loc = who],[nat = per], [prod = no], [nau = no], [Ml = yes]}. Thus, since 
a relation L{psycho) C L{m.c.h.) holds (i.e.,pi{L{m.c.h.), L{psycho)) = 1.0), a 
new decision attribute is D\ = {m.c.h., psycho} and D 2 = {migra}, and a 
partition P = {Di, D 2 } is obtained. From this partition, two decision tables will 
be generated, as shown in Table 2 and Table 3 in the fifth step. 

In the sixth step, classification rules for D\ and D 2 are induced from Table 
2. For example, the following rules are obtained for Di. 



[Ml = yes] 
[prod = no] 
[nau = no] 
[nat = per] 

[loc = who] 
[age = 50. ..59] 



Di a = 1.0, K = 1.0, supported by {1,2, 5, 6} 

Di a = 1.0, K = 1.0, supported by (1,2, 5, 6} 

Di a = 1.0, K = 0.75, supported by {1,2,5} 

Di a = 1.0, K = 0.75, supported by {1,2,6} 

Di a = 1.0, K = 0.75, supported by {2,5,6} 

Di a = 1.0, K = 0.5, supported by {2,6} 



In the seventh step, classification rules for m.c.h. and psycho are induced 
from Table 3. For example, the following rules are obtained from m.c.h.. 

^ This method is an extension of PRIMEROSE4 reported in [17]. In the former 
paper, only rigid set-inclusion relations are considered for grouping; on the other 
hand, rough-inclusion relations are introduced in this approach. Recent empirical 
comparison between set-inclusion method and rough-inclusion method shows that 
the latter approach outperforms the former one. 
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procedure Rule Induction {Total Process); 

var 

i : integer; M, L, R : List; 

Lo ■ List; /* A list of all classes */ 

begin 

Calculate aii{Di) and K,n{Di) for each elementary relation R and each class Di; 
Make a list L{Di) = {R\kh{D) — 1.0}) for each class Di; 
while (Ld ^ fj)) do 

begin 

Di ■— first(LD); M ■— Ld — Di; 
while (M (f)) do 
begin 

Dj := first{M); 

if {g{L{Dj), L{Di)) < 5 ^) then L2(A) := L2(A) + {Dj}; 

M ■- M - Dj; 

end 

Make a new decision attribute D[ for L2{Di); 

Ld Ld — Di; 

end 

Construct a new table {T2{Di))for L2{Di). 

Construct a new table(T(D')) for each decision attribute D[; 

Induce classification rules R 2 for each L 2 {D); /* Fig. 2 */ 

Store Rules into a List R{D) 

Induce classification rules Rd for each D' in T{D'); /* Fig. 2 */ 

Store Rules into a List R{D'){= R{L2{Di))) 

Integrate R 2 and Rd into a rule Rd; /* Fig. 3 */ 
end {Rule Induction }; 

Fig. 1. An Algorithm for Rule Induction 



[nau = no] — > m.c.h. a = 1.0, k = 1.0, supported by {1,2,5} 

[age = 40. ..49] — > m.c.h. a = 1.0, k = 0.67, supported by {2,5} 

In the eighth step, these two kinds of rules are integrated in the following 
way. Rule [Ml = yes] — > [nau = no] — *■ m.c.h. and [age = 40. ..49] — > m.c.h. 
have a supporting set which is a subset of {1,2, 5, 6}. Thus, the following rules 
are obtained: 

[Ml = yes] & [nau=no] — > m.c.h. a = 1.0, k = 1.0, supported by {1,2,5} 
[Ml = yes] & [age=40...49] — > m.c.h. ot = 1.0, k = 0.67, supported by {2,5} 

5 Experimental Results 

The above rule induction algorithm is implemented in PRIMEROSE4.5 (Proba- 
bilistic Rule Induction Method based on Rough Sets Ver 4.5), ® and was applied 



® The program is implemented by using SWI-prolog [14] on Sparc Station 20. 
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procedure Induction of Classification Rules', 

var 

i integer; M, Li : List; 

begin 

Li := Ler', /* Ler' List of Elementary Relations */ 
i'.= l; M:={}; 

for i := 1 to n do /* n: Total number of attributes */ 

begin 

while { Li ^ {} ) do 
begin 

Select one pair R — A[ai = Vj\ from Li; 

Li := Li - {R}; 

if {aR{D) > 5a) and {hr{D) > SC) 

then do Sir '■= Sir + {R}; j* Include R as Inclusive Rule */ 
else M '.= M + {R}; 

end 

Li+i ~ (A list of the whole combination of the conjunction formulae in M); 

end 

end {Induction of Classification Rules }; 

Fig. 2. An Algorithm for Classification Rules 



Table 2. A Table for a New Partition P 





age 


loc 


nat prod 


nau 


Ml class 


1 


50. 


..59 


occ 


per 


0 


0 


1 


Di 


2 


40. 


..49 


who 


per 


0 


0 


1 


Di 


3 


40. 


..49 


lat 


thr 


1 


1 


0 


D2 


4 


40. 


..49 


who 


thr 


1 


1 


0 


D2 


5 


40. 


..49 


who 


rad 


0 


0 


1 


Di 


6 


50. 


..59 


who 


per 


0 


1 


1 


Di 



to databases on differential diagnosis of headache, whose training samples consist 
of 1477 samples, 20 classes and 20 attributes. 

This system was compared with PRIMEROSE4 [17], PRIMEROSE [15], 
C4.5 [12], CN2 [5], AQ15 [9] and fc-NN [2] ® with respect to the following points: 
length of rules, similarities between induced rules and expert’s rules and perfor- 
mance of rules. 

In this experiment, length was measured by the number of attribute- value 
pairs used in an induced rule and Jaccard’s coefficient was adopted as a similarity 
measure [6]. Concerning the performance of rules, ten- fold cross-validation was 
applied to estimate classification accuracy. 



® The most optimal k for each domain is attached to Table 4. 
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procedure Rule Integration-, 

var 

i : integer-, M, L2 : List-, R(Di) : List-, /* A list of rules for Di * j 
Lo ■ List-, !* A list of all classes */ 

begin 

while(LD ^ <^) do 
begin 

Di -.= firstiLo)-, M -- Z/2(A); 

Select one rule R' D'i from R{L2{Di)) . 
while (M yf (j}) do 

begin 

Dj -- first{M)-, 

Select one rule R dj for Dj ; 

Integrate two rules: R A R' ^ dj . 

M -- M -{Dj}- 



end 

Ld := Ld — Di-, 



end 

end {Rule Combination} 



Fig. 3. An Algorithm for Rule Integration 

Table 3. A Table for D\ 

age loc nat prod nan Ml class 

1 50. ..59 occ per 0 0 1 m.c.h. 

2 40. ..49 who per 0 0 1 m.c.h. 

5 40. ..49 who rad 0 0 1 m.c.h. 

6 50. ..59 who per 0 11 psycho 



Table 4 shows the experimental results, which suggest that PRIMEROSE4.5 
outperforms PRIMEROSE4(set-inclusion approach) and the other four rule in- 
duction methods and induces rules very similar to medical experts’ ones. 

6 Discussion: Granular Fuzzy Partition 

6.1 Granular Fuzzy Partition 

Coverage is also closely related with granular fuzzy partition, which is introduced 
by Lin [8] in the context of granular computing. 

Since coverage hr{D) is equivalent to a conditional probability, P{R\D), this 
measure will satisfy the condition on partition of unity, called RiL-partition 
(If we select a suitable partition of universe, then this partition will satisfy 
= 1.0. ) Also, from the definition of coverage, it is also equivalent to 
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Table 4. Experimental Results 



Method Length Similarity Accuracy 



Headache 


PRIMEROSE4.5 


8.8 


± 


0.27 


0.95 


± 


0.08 


95.2 


± 


2.7% 


PRIMEROSE4.0 


8.6 


± 


0.27 


0.93 


± 


0.08 


93.3 


± 


2.7% 


Experts 


9.1 


± 


0.33 


1.00 


± 


0.00 


98.0 


± 


1.9% 


PRIMEROSE 


5.3 


± 


0.35 


0.54 


± 


0.05 


88.3 


± 


3.6% 


C4.5 


4.9 


± 


0.39 


0.53 


± 


0.10 


85.8 


± 


1.9% 


CN2 


4.8 


± 


0.34 


0.51 


± 


0.08 


87.0 


± 


3.1% 


AQ15 


4.7 


± 


0.35 


0.51 


± 


0.09 


86.2 


± 


2.9% 


fc-NN (7) 


6.7 


± 


0.25 


0.61 


± 


0.09 


88.2 


± 


1.5% 


fc-NN (i) shows the value of i which gives the highest performance in 



k {l<k< 20). 



the counting measure for p| since |L?| is constant in a given universe U. 
Thus, this measure satisfies a “nice context”, which holds: 

\[xU,f]D\ + \[x]n,f]D\<\D\. 

Hence, all these features show that a partition generated by coverage is a kind 
of granular fuzzy partition [8]. This result also shows that the characterization 
by coverage is closely related with information granulation. 

From this point of view, the usage of coverage for characterization and 
grouping of classes means that we focus on some specific partition generated 
by attribute- value pairs, the coverage of which are equal to 1.0 and that we 
consider the second-order relations between these pairs. It is also notable that if 
the second-order relation makes partition, as shown in the example above, then 
this structure can also be viewed as granular fuzzy partition. 

However, rough inclusion and accuracy do not always hold the nice context. 
It would be our future work to examine the formal characteristics of coverage 
(and also accuracy) and rough inclusion from the viewpoint of granular fuzzy 
sets. 

7 Conclusion 

In this paper, the characteristics of experts’ rules are closely examined, whose 
empirical results suggest that grouping of diseases are very important to realize 
automated acquisition of medical knowledge from clinical databases. Thus, we 
focus on the role of coverage in focusing mechanisms and propose an algorithm 
on grouping of diseases by using this measure. The above experiments show that 
rule induction with this grouping generates rules, which are similar to medical 
experts’ rules and they suggest that our proposed method should capture medical 
experts’ reasoning. Interestingly, the idea of this proposed procedure is very 
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similar to rough mereology. The proposed method was evaluated on three medical 
databases, the experimental results of which show that induced rules correctly 
represent experts’ decision processes and also suggests that rough mereology 
may be useful to capture medical experts’ decision process. 
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Abstract: In the classic rough set theory [1], two important concepts 
(reduct of information system and relative reduct of decision table) are 
defined. They play important roles in the KDD system based on rough 
sets, and can be used to remove the irrelevant or redundant attributes 
from practical database to improve the efficiency of rule extraction and 
the performance of the rules mined. Many researchers have provided 
some reduct-computing algorithms. But most of them are designed for 
static database; hence they don't have the incremental learning 
capability. The paper first proposes the idea of discernibility system and 
gives out its formal definition, then presents the concept of reduct in 
discernibility system, which can be viewed as a generalization of the 
relative reduct of decision table and the reduct of information system. 

At last, based on the concept of discernibility system, an incremental 
algorithm, ASRAI, for computing relative reduct of decision table is 
presented. 

Keywords: Rough sets, information system, decision table, reduct, 
relative reduct, discernibility system, incremental learning. 

1 Introduction 

With the rapid development of computer technique, the data-collecting ability of 
mankind has been increasing rapidly and various kinds of databases have been 
becoming larger and larger. By contrary, man's data-dealing ability has fallen behind; 
a lot of data not handled is put away as rubbish in various fields of society. How to 
extract useful model from large amount of business data for making correct decision? 
How to handling scientific data for discovering underlying scientific laws? How to 
mine large number of clinic cases for constructing medical expert system? All of these 
have become the potential impetus for KDD research. 

Practical database is always too huge, which may contain many redundant, 
irrelevant, or miscellaneous attributes. The existence of such attributes will decrease 
greatly the efficiency of rule-extraction process and the performance of rules mined. 
Performing attribute reduction to remove such redundant or irrelevant attributes has 
become an important phase in KDD process [2]. 
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In the early 80s, Professor Pawlak Z presented the theory of Rough Sets, which can 
be used as a formal framework for attribute reduction. In the rough set theory, 
knowledge is considered as classification ability, and the essence of attribute reduction 
is thought as preserving the invariant of some kind of discerning ability through a 
minimal set of attributes. We think that the discerning should refer to two objects, and 
discernible relation should be symmetric, so we can use the concept of non-ordered 
pair as the basic element of discernible relation. Based on it, we present the concept of 
discernibility system as a formal framework for reducing the set of attributes. This can 
be regarded as an extension of classic Rough Sets. 

This paper is organized as follows: 

Section 2 gives some basic concepts of classic Rough Sets. In section 3, based on 
the concept of non-order pair, we give out the definitions of discernibility system and 
the reduct of it. In section 4, the problem of computing reduct set in information 
system or relative reduct set in decision table is transformed into the problem of 
computing the reduct set of a specific discernibility system. An incremental algorithm 
is developed in section 5. At the end of this paper, an example is used to illustrate the 
idea of this algorithm. 

2 Basic Concept in Classic Rongh Sets 

In the theory of the classic Rough Sets [1], we have the following definitions: 

Definition2.1. An information system is a pair A=(U, A), where U is a non-empty, 
finite set called the universe and^f is a non-empty, finite set of attributes, i.e. a: U^Va 
for aeA, where Va is called the value set of attribute a. The elements of U are called 
objects. 

Every information system A^(U, A) and a non-empty set define a B- 

information function by InfB(x)^{(a, a(x)): aeB} for xeU. The set {InfA(x): xeU} is 
called the ^-information set and it is denoted by INF (A). 

Definition2.2. Let A=(U, A) be an information system. For every subset of 
attributes an equivalence relation, denoted by IND(B) called the 5-indiscernible 
relation, is associated and defined hy IND(B) = {(s, s'): s, s'eU, b(s)=b(s) for every 
beB}. 

In this paper, indiscernible relation is also thought as a set of some non-ordered 
pairs on U. (As for the definition of non-ordered pair, you can refer to the 
definitions. 1.) 

For (x, x)e IND(B), object x and object x' can't be discerned by the attributes in B. 
The minimal subset B of A that satisfies IND(A)^IND(B) is called a reduct of A. That 
is, ifSo4 is a reduct of A, then IND(A)^IND(B) and VCciB(IND(C)=> 1ND(A)). The 
family set of all reducts of A is called the reduct set of A. In the paper, (x, x')e IND(B) 
is also written as x IND(B) x'. 




222 Zongtian Liu and Zhipeng Xie 



Definition2.3. If A=(U, A) is a information system, So4 is a set of attributes and 

XaU is a set of objects, then set B*X={seU: [s]b£X} and set B*X={seU: [sj 
are called S-lower and 5-upper approximation ofXin,4 respectively. 

Definition2.4. A decision table is an information system of the form A=(U, 
Au{d}), where dgA is a distinguishable attribute called the decision. The elements 
of A are called conditions. 

For simplicity of notation we assume that the set Vd of values of the decision d is 
equal to {1, 2, .... r(d)}. We can observe that the decision d determines the partition 
{Xi, Xj, .... Xr(d)} of the universe U, where X* = {xeU: d(x)=k} for l<k<r(d). 

Definition2.5. Given a decision table A^(U, Au{d}), 5o4, if Vd={l, 2, r(d)} 

andXy, X 2 , .... Xr(d) are decision classes of A, then the set B*XiuB*X 2 U. . ,uB*Xr(d) is 
called 5-positive region of A, denoted by POS(B, {d}). 

Definition2.6. Given a decision table A=(U, Au{d}), 5cT, if POS(B, {d})= 
POS(A, {d}), and VCczB POS(C, {d})czPOS(A, {d}), then 5 is called a relative reduct 
of ,4. The family set of all relative reducts of ,4 is called the relative reduct set of ,4. 

Definition2.7. If A^(U, Au{d}) is a decision table, then we define a function 
3a: U^p({1, 2, ..., r(d)}), called generalized decision in^, by = {i\ x' IND(A) x 
and d(x' )=i}. 

Definition2.8. A decision table A is called consistent if |5^(vj|=l for any xeU\ 
otherwise A is inconsistent. 

NOTE: The decision table that appears in this paper is assumed to be consistent. 

3 Discernibility System 

Definitions. 1. Given a set X of objects, if xeX, yeX then two-element set {x, y} is 

called a non-ordered pair (NOP) on X, which can be denoted by (x, y). It is apparent 
that (x, y)=(y, x). We use ANO(X) to denote the set of all the non-ordered pairs on X, 
that is, ANO(X)={ (x, y) \ xeX, yeX }. 

Definition3.2. Given a non-empty, finite set X of objects and a non-empty, finite set 
A of attributes, we call (x, y) an ^-discernible NOP on X if x, yeX and 3 ae A (a(x)^ 
a(y)). The set of all the ^-discernible NOP on X is denoted by ADNO(X, A), that is, 
ADNO(X, A)={ (x, y) \ x, yeX, 3 aeA(a(x)^ a(y)) }. 

Definition3.3. A triple D=(X, A, D) is called a discernibility system, where X is a 
non-empty, finite set of objects, .4 is a non-empty, finite set of attributes and D is a 
subset of ADNO(X, A). 

Definition3.4. Given a discernibility system D=fX, A, D), if 5^ satisfies 

V(x,y)e D (3bEB(b(x)^b(y))) (1) 

then we say that 5 is a distinction of D. 
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Property3.1. Given a discernibility system D^(X, A, D), if is a distinction oiD 
and then C is a distinction of D. 

Proof: omitted. 

Property3.2. Give two discernibility systems D1=(X, A, Dl) and D2=(X, A, D2), 
if D1^D2 and is a distinction of D2 then B is also a distinction of Dl. 

Proof: omitted. 



Definition3.5. Given a discernibility system D=(X, A, D), if 5 is a distinction of D 
and any CcS is not a distinction of D, then B is called a reduct of D. The family set of 
all the reducts of D is called reduct set of D and it is denoted by RED(Z)). 

Given a discernibility system D=(X, A, D), if B^ is a distinction of D, then we 
can get a reduct of D through the following algorithm: 

for beB do 

begin 

if B-{b} is a distinction of D, then B:=B-{b}; 
end; 

B ' :=B} 

It is apparent that B' is a minimum distinction ofZ), so i?' is a reduct of D according 
to definition 3.5. We have the following theorem: 



Theorem3.1. Given a discernibility system D=(X, A, D), if is a distinction 
of D, S is the reduct set of D, then there must exist a reduct SeS that satisfies S^. 
Proof: omitted. 



Definition3.6. Given a family B^{B1, B2, . . Bn}, we can define C=MIN_RED(5) 
as the minimum of B\ 

Mm_RED{B)^{BeB \ VBieB (BiczB)}. (2) 



Theorem3.2. Given two discernibility system D1=(X, A, Dl) and D2=(X, A, D2), 
if SI, S2 are the reduct sets of Dl and D2 respectively, S' = {Slu S2 \ SleSl, 
S2eS2 }, then 5'=MIN_RED(5'' ) is the reduct set of the discernibility system 
D=(X,A, DluD2). 

Proof: 

1 . At first we prove that any SeS is a reduct of D. 

For any S e S, (x', y') e DluD2, we have S e S', so there exist SleSl, 
S2eS2 that satisfy Sl^S and 52cS. If (x', y')e Dl, then there exists seSl^S 
that satisfies s(x')A=s(y'). If (x', y) eD2, then there exists seS2aS that satisfies 
s(x)A=s(y). Clearly we have: for any S e S, (x’, y) e DluD2, there exists seS 
satisfies s(x')?is(y'), that is, 5 is a distinction of D. 
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Assume that S is not a reduct of D, then there exists SoCiS is a distinction of D 
(Definitions. 5), then there exists SS^So is a reduct of D (Theorem 3.1). 
According to property3.2, SS is also a distinction of D1 and D2. So there exist 
SSI E SI, SS2eS2 that satisfy SSl^SS and SS2aSS according to theorems. 1. 
Therefore (SSI u SS2)a SSciS and (SSI u SS2)e S'. According to 
definition 3.6, we have SgS, which is a contradiction. So 5 is a reduct of D. 

2. Then we prove that any reduct of D belongs to S. 

The process of the proof is similar to above proof, so it is omitted. 



4 Discernibility Invariant 

Given an information system, we think that there exist two kinds of NOP (non-ordered 
pair): one kind includes the discernible pairs; the other kind includes the pairs 
expected to be discernible. For example, in classic rough sets, if we are given a 
decision table A=(U, Au{d}), then the first kind of pair is referring to (x, y)eANO(U) 
that can be discerned by the set A of conditional attributes, that is, 3 aeA(a(x)^ a(y). 
On the other hand, if (x, y)eANO(U) and d(x)^(y) then we think that (x, y) is 
expected to be discernible, that is, we expect to discern between two elements of 
different decision values. If we are given an information system A=(U, A) then 
the (x, y)eANO(U) that satisfies 3aeA(a(x)^(y) is considered to be discerned by A, 
while any (x, y)eANO(U) is considered to be expected to be discernible. 

So we can think that: 

reducing attributes in an information system is to search a minimal set of attributes 

that preserves an invariant ^the discernibility of all the NOPs that are expected to 

be discernible and can be discerned. 

Based on the above idea, we have the followings: 



Theorem4.1. Given an information system A=(X, A), then ADNO(X, A)~ 
ANO(X)-IND(A). 

Proof: omitted. 

Theorem4.2. Given an information system A=(X, A), is a reduct of^ if and 
only if 5 is a reduct of discernibility system D^(X A, ADNO(X, A)). 

Proof: 

1. At first, we prove that if is a reduct of A then i? is a reduct of D. 
For any (x, y)s ADNO(X, A), then (x, y)glND(A). If 5 is a reduct of A, then 
there exists b^B that satisfies b(x)^b(y). So is a distinction ofZ). 

For any CczB, because 5 is a reduct of A, there exists x, yeX that (x, y)g 
IND(A) and (x, y)elND(C), that is, (x, y)eADNO(X, A) and (x, y)g 
ADNO(X, C). (Note: we have 1ND(A)£IND(C) because A=>C.) So C is not a 
distinction of D. 
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Followed by the above, we have: if is a reduct of A then is a reduct of D. 

2. Similarly, we can prove that if S is a reduct of D then 5 is a reduct of ,4. 

Theorem4.3. Given a decision table A=(X, Au{d}), So4 is a relative reduct of A 
if and only if S is a reduct of discernibility system D=(X, A, RD(X, A)), where 
RD(X, A) = {(x, y)e ADNO(X, A) \ d(x)^ d(y)}= {(x, y)e ANO(X) \ d(x)^ d(y) 
and 3beA (b(x) ^b(y)) }. 

Proof: 

1. At first we prove that if 5 is a relative reduct of A then is a reduct of 
discernibility system /)=("(/, RD(X, A), A). 

For any (x, y)eRD(X, A), we have d(x)A=d(y) and (x, y)glND(A). Because B is 
a relative reduct of ,4, we have POS(B, {d})=POS(A, {d}). Decision table A 
being consistent, we have x, yePOS(A, {d}) and x, yePOS(B, {d}). Because 
d(x)?ki(y)=>(x, y)glND(B) (otherwise we have x, y^POS(B, {d}), which is a 
contradiction.), there must exists beB that b(x)A= b(y). So 5 is a distinction 
ofZ). 

For any CczB, if 5 is a relative reduct of A, then POS(A, {d})z>POS(C, {d}), 
that is, there exists xeX that xePOS(A, {d}) and xgPOS(C, {d}). So there 
exists yePOS'tlf, {d}) that d(x)A:d(y), (x, y)g IND(A) and (x, yje IND(C). That 
is, there doesn't exist ceC that c(x)Ac(y). So C is not a distinction of D. 
According definition 3.5, we know that 5 is a reduct ofD. 

2. Analogically, we can also prove: if 5 is a reduct of D=(X, A, D(X, A)), then B 
is a relative reduct of ,4. 

Based on the theorems above, we can easily transform the problems of computing 
reduct set in information system and computing relative reduct set in decision table 
into the problem of computing the reduct set of a specific discernibility system. 



5 An Incremental Algorithm for Compnting the Relative Rednct 
Set of Decision Table 



Theorems. 1. Given a decision table A=(X, Au{d}), (x, x) eRD(X, A), if 
b={b\beA, b(x)9^b(x)}, then {{b}\beb} is the reduct set ofZ)=fX, A, {(x, x')}). 

Proof: omitted. 

In a decision table.4=f3C Au{d}), S is the relative reduct set, that is, S is the reduct 
set of discernibility system D^(X, A, RD(X, A)). If a new object x is added to X, then 
we have RDJX)={(x, x')\x'eX, 3aeA(a(x)A= a(x)) and d(x)?t d(x')}. Because 
RD(Xu{x}, A)= RD(X, A)u RDJX), the problem of computing the relative reduct set 
in decision table A ’=(Xu{x}, A) can be transformed into the problem of computing the 
reduct set of discernibility system D'=fXL//3c/, A, RD(Xu{x}, A)). 
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Now, we present an incremental algorithm, ASRAI, as follows: 

Input: the relative reduct set S of a decision table 
A={X, AU{d}); 

an object x that is newly added; 

Output: the relative reduct set S' of the decision 
table A' = (xu{x}, Au{d}); 

Begin 

S’ ;= S; 

For any x ' eX do 
Begin 

If not (xIND (A) X ' ) and (d(x)^ d(x')) then 
Begin 

b : = {bjbeA, b (x) 9^b (x ' ) } ; 

S ' : =minimize ( { Bu{b} j BeS ' , beb } ) ; 

End; 

End; 

End. 

The correctness of this algorithm can be proved easily through the theorem 3.2, 
theorem 4.3 and theorem 5.1. The minimize() in this algorithm is a function that 
computing the minimum of a family set (refer to definition 3.6). In its practical 
implementation, we can use the fact that {b} is one-element set, in order to shorten the 
running time of this algorithm. 



6 Example 

Example: To illustrate the idea of algorithm ASRAI, we consider the following 
decision table A=(U, Au{d}), which is extracted from [3]. Assume object 6 is the 
object newly added. 





H 


W 


R 


E 
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Short 


Light 


Dark 


Blue 


1 




Tall 


Heavy 


Dark 


Blue 


1 




Tall 


Heavy 


Dark 


Brown 


1 


4 


Tall 


Heavy 


Red 


Blue 


2 




Short 


Light 


Blond 


Blue 


2 


6 


Tall 


Heavy 


Blond 


Brown 


1 



For A=^l, 2, 3, 4, 5}, RD(X A) = {(1, 4), (1, 5), (2, 4), (2, 5), (3, 4), (3, 5)}, the 
relative reduct set of decision table (X, Au{d}) is {{R}} which is also a reduct of 
discemibility system (X, A, RD(X, A)). 

If object x=6 is newly added, \henRD/X)={{A, 6), (5, 6)}. 
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First we can compute the reduct set of reduct set of discernibility system (Xu{6}, A, 
RD(X, A)u{(4, 6)}), which is equal to mm\m\ZQ{{{R}, {R, E}})= {{R}}. Then we 
compute the reduct set of discernibility system (Xu{6}, A, RD(X, A)u{(4, 6)} 
u{(5, 6)})= (Xu{6}, A, RD(X, A)u RDJX)}) = (Xu{6}, A, RD(Xu{6}, A)) which is 
equal to minimize(j'/«,//;, (R, W}, {R, E}}) = {{ R,H}, {R, W}, {R, E}}. 

Thus we have got the relative reduct set of decision table A=(U, Au{d})=(Xu{6}, 
Au{d}). 



7 Conclusion 

In this paper, we first present the definitions of discernibility system (definitions. 3) 
and reduct in it (definitions. 5), which can be used as a formal framework for attribute 
reduction based on Rough sets. 

Then we develop an incremental algorithm for computing the relative reduct set of 
decision table. The algorithm has been implemented by authors and runs correctly. 
Flowever, practical databases always contain a lot of noise; so approximate invariant 
should be defined to make our method be noise-proof, which is our next researched 
content. 
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Abstract. Self-organizing maps are an unsupervised neural network 
model which lends itself to the cluster analysis of high-dimensional input 
data. However, interpreting a trained map proves to be difficult because 
the features responsible for a specific cluster assignment are not evi- 
dent from the resulting map representation. In this paper we present our 
LabelSOM approach for automatically labeling a trained self-organizing 
map with the features of the input data that are the most relevant ones 
for the assignment of a set of input data to a particular cluster. The 
resulting labeled map allows the user to better understand the structure 
and the information available in the map and the reason for a specihc 
map organization, especially when only little prior information on the 
data set and its characteristics is available. 



1 Introduction 

The self-organizing map (SOM) [2,3] is a prominent unsupervised neural network 
model for cluster analysis. Data from a high-dimensional input space is mapped 
onto a usually two-dimensional output space with the structure of the input data 
being preserved as faithfully as possible. This characteristic is the reason why 
the SOM found large attraction for its utilization in a wide range of application 
arenas. However, its use in the knowledge discovery domain has been limited 
due to some drawbacks in the interpretability of a trained SOM. As one of the 
shortcomings we have to mention the difficulties in detecting the cluster bound- 
aries within the map. This problem has been addressed intensively and led to a 
number of enhanced visualization techniques allowing an intuitive interpretation 
of the self-organizing map, e.g. the U-Matrix [10], Adaptive Coordinates [6] and 
Cluster Connection techniques [5]. 

However, it still remains a challenging task to label the map, i.e. to determine 
the features that are characteristic for a particular cluster. Given an unknown 
data set that is mapped onto a self-organizing map, even with the visualization 
of clear cluster boundaries it remains a non-trivial task to elicit the features that 
are the most relevant and determining ones for a group of input data to form 
a cluster of its own, which features they share and which features distinguish 
them from other clusters. We are looking for a method that allows the automatic 
assignment of labels describing every node in the map. A method addressing this 
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problem using component planes for visualizing the contribution of each vari- 
able in the organization of a map has been presented recently [1]. This method, 
however, requires heavy manual interaction by examining each dimension sepa- 
rately and does thus not offer itself to automatic labeling of SOMs trained with 
high-dimensional input data. 

In this paper we present our novel LabelSOM approach to the automatic 
labeling of trained self-organizing maps. In a nutshell, every unit of the map is 
labeled with the features that best characterize all the data points which are 
mapped onto that particular node. This is achieved by using a combination of 
the mean quantization error of every feature and the relative importance of that 
feature in the weight vector of the node. 

We demonstrate the benefits of this approach by labeling a SOM that was 
trained with a widely used reference data set describing animals by various 
attributes. The resulting labeled SOM gives a description of the animals mapped 
onto nodes and characterizes the various (sub)clusters present in the data set. 
We further provide a real-world example in the field of text mining based on 
a digital library SOM trained with the abstracts of scientific publications. This 
SOM represents a map of the scientific publications with the labels serving as a 
description of their topics and thus the various research fields. 

The remainder of the paper is organized as follows. In Section 2 we present a 
brief review of the self-organizing map, its architecture and training process as 
well as the LabelSOM method to assign a set of labels for every node in a trained 
SOM and provide results from a well known reference data set. In Section 3 we 
provide the results of applying the LabelSOM method to a real world data set 
labeling a map representing the abstracts of scientific publications. We further 
demonstrate how additional information on the cluster structure can be derived 
from the information provided by the labeling. A discussion of the presented 
LabelSOM method as well as its importance for the area of knowledge discovery 
and data mining is provided in Section 4. Finally, our conclusions are presented 
in Section 5. 



2 SOM and LabelSOM 

The self-organizing map is an unsupervised neural network providing a mapping 
from a high-dimensional input space to a usually two-dimensional output space 
while preserving topological relations as faithfully as possible. The SOM consists 
of a set of i nodes arranged in a two-dimensional grid, with a weight vector rrii&W^ 
attached to each node. Elements from the high dimensional input space, referred 
to as input vectors x S 5R", are presented to the SOM and the activation of each 
node for the presented input vector is calculated using an activation function. 
Commonly, the Euclidean distance between the weight vector of the node and the 
input vector serves as the activation function. In the next step, the weight vector 
of the node showing the highest activation (i.e. the smallest Euclidean distance) 
is selected as the ‘winner’ c and is modified as to more closely resemble the 
presented input vector. Pragmatically speaking, the weight vector of the winner 
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is moved towards the presented input signal by a certain fraction of the Euclidean 
distance as indicated by a time-decreasing learning rate a. Thus this node’s 
activation will be even higher the next time the same input signal is presented. 
Furthermore, the weight vectors of nodes in the neighborhood of the winner are 
modified accordingly, yet to a less strong amount as compared to the winner. 
This learning procedure finally leads to a topologically ordered mapping of the 
presented input signals, i.e. similar input signals are mapped onto neighboring 
regions of the map. 

Still, the characteristics of each node are not detectable from the map rep- 
resentation itself. With no a priori knowledge on the data, even providing infor- 
mation on the cluster boundaries does not reveal information on the relevance 
of single attributes for the clustering and classification process. In the LabelSOM 
approach we determine those vector elements (i.e. features of the input space) 
that are most relevant for the mapping of an input vector onto a specific node. 
This is basically done by determining the contribution of every element in the 
vector towards the overall Euclidean distance between an input vector and the 
winners’ weight vector. The LabelSOM method is built upon the observation that 
after SOM training the weight vector elements resemble as far as possible the 
corresponding input vector elements of all input signals that are mapped onto 
this particular node as well as to some extent those of the input signals mapped 
onto neighboring nodes. Vector elements having about the same value within the 
set of input vectors mapped onto a certain node describe the node in so far as 
they denominate a common feature of all data signals of this node. If a majority 
of input signals mapped onto a particular node exhibit a highly similar input 
vector value for a particular feature, the corresponding weight vector value will 
be highly similar as well. Thus the quantization error for all individual features 
serves as a guide for their relevance as a class label. 

However, in real world text mining application scenarios we are usually faced 
with a high number of attributes which are not existent and thus have a value 
of 0 for a certain class of input signals. These attributes frequently yield a quan- 
tization error of almost 0 for certain nodes but are nevertheless not suitable for 
labeling the node. The reason is, that we want to describe the present features 
that are responsible for a certain clustering rather than describe a cluster via 
the features that are not present in the data forming that cluster. 

Hence, we need to determine those vector elements from each weight vector 
which, on the one hand, exhibit about the same value for all input signals mapped 
onto that specific node as well as, on the other hand, have a high overall value 
indicating its importance. Formally this is done as follows: Let Ci be the set of 
input patterns Xj mapped onto node i. Summing up the distances for each vector 
element over all the vectors xj (xj € Ci) yields a quantization error vector qi for 
every node (Equation 1). 




k = l..n 



( 1 ) 
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To create a set of p labels for every node i we take the p smallest vector 
elements from the quantization error vector q^. In order to get rid of labels 
describing non-existent features we define a threshold r and take only features 
having corresponding weight vector entries above r. The threshold is typically 
set to very small values slightly above 0. 



Attribute||Dove|Hen|Duck|Goose|OwI|Hawk|EagIe|Fox|Dog|WoIf|Cat|Tiger|Lion|Horse|Zebra|Cow| 
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Table 1. Input Data Set: Animals 



Figure 1 shows the result of labeling a 6 x 6 SOM trained with the animals 
data set [8] given in Table 1. (Please note that the input vectors have been 
normalized to unit length prior to the training process.) The standard represen- 
tation of the resulting SOM is given in Figure la, where each node is assigned 
the name of the input vectors that were mapped onto it. In the resulting map 
we find a clear separation of the birds in the upper part of the map from the 
mammals in the lower area. However, this conclusion can only be drawn if one 
has prior knowledge about the data and can thus infer the characteristic features 
of sets of input data from their name. 

In Figure lb, each node is assigned a set of 5 labels based on the quantization 
error vector and the value of the vector element. We find that each animal is 
labeled with its characteristic attributes. In addition to the information to be 
drawn from the characterization of the data points mapped onto the nodes, 
further information on the cluster boundaries can be derived, such as the fact 
that the birds are distinguished from the mammals by the fact that the all have 2 
legs and feathers instead of 4 legs and hair. Further subclusters are identified by 
the size of the animals and their preferences for /em hunting, flying, swimming, 
etc. For example, the big mammals being located in the lower right corner of 
the map as a subcluster of the mammals. As another subcluster consider the 
distinction of hunting vs. non-hunting animals - irrespective of their belonging 
to the group of birds or group of mammals. The hunting animals may be found 
on the left side of the map whereas the non-hunting animals are located on 
the right side. Thus we can not only identify the decisive attributes for the 
assignment of every input signal to a specific node but also detect the cluster 
boundaries between nodes that are not winner for any specific input signal and 
tell the characteristics and extents of subclusters within the map. Mind that not 
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5 Labels for 6 x 6 SOM 
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Fig. 1. Labeling of a 6 x 6 SOM trained with the animals data set. a.) Standard 
Representation of SOM b.) 5 labels assigned to each node of the SOM 

all nodes have the full set of 5 labels assigned, i.e. one or more labels are empty 
(none) like the node representing the dog in the lower right corner. This is due to 
the fact that less than 5 vector elements have a weight vector value greater 
than T. 

3 A Labeled Library SOM 

For the following larger example we used 48 publications from the Depart- 
ment of Software Technology which are available via our department web server 
(www.ifs.tuwien.ac.at/ifs/). We used full-text indexing to represent the various 
documents. The indexing process identified 482 content terms, i.e. terms used 
for document representation. During indexing we omitted terms that appear in 
less than 10% or more than 90% of all documents and applied some basic stem- 
ming rules. The terms are weighted according to a simple tf x idf weighting 
scheme [9]. 

In the area of digital libraries we are faced with a tremendous amount of 
’noise’ in the input data resulting from the indexing of free-form documents. 



Automatic Labeling of Self-Organizing Maps 233 



In the example presented thereafter another problem originates in the fact that 
abstracts contain little to no redundancy in terms of the information presented 
in the abstracts as well as in the choice of words. Due to their limited length 
and condensed structure, word repetition and clarification of the most impor- 
tant aspects within the text usually are not present, resulting in less specific 
vector representations of the documents. Thus using only the abstracts provides 
a somewhat more challenging task than using the complete documents. For some 
deeper discussion on the utilization of SOMs in text data mining we refer to [7,4] . 




Fig. 2. Standard Representation of a 7 x 7 SOM trained with paper abstracts 



Figure 2 depicts a 7 x 7 SOM trained with the 48 abstracts. The various 
nodes again list the input vectors, i.e. the abstracts, that were mapped onto the 
nodes. The naming convention for the abstracts is such as to give the name of 
the first author of a paper as the first three characters, followed by the short 
form label of the respective conference. 

Without any additional knowledge on the underlying documents, the result- 
ing mapping of the SOM given in Figure 2 is hard to interpret, although the 
names of the authors may give some hints towards the cluster structure of the 
map (at least if you know the authors and have some knowledge concerning 
their research areas). However, no information on the contents of the papers, 
i.e. the keywords, can be drawn from the resulting map. With a weight vector 
dimensionality of 482, manual inspection of the importance of the single vector 
elements simply is not feasible. 
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Fig. 3. Labeling of a 7 x 7 SOM trained with paper abstracts 



Having a set of 10 labels automatically assigned to the the single nodes in 
Figure 3 leaves us with a somewhat clearer picture of the underlying text archive 
and allows us to understand the reasons for a certain cluster assignment as well 
as identify overlapping topics and areas of interest within the document collec- 
tion. For example, in the upper left corner we find a group of nodes sharing labels 
like skeletal plans, clinical, guideline, patient, health which deal with the devel- 
opment and representation of skeletal plans for medical applications. Another 
homogeneous cluster can be found in the upper right corner which is identified 
by labels like gait, pattern, malfunction and deals with the analysis of human 
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gait patterns to identify malfunctions and supporting diagnosis and therapy. A 
set of nodes in the lower left corner of the map is identified by a group of labels 
containing among others software, process, reuse and identifies a group of pa- 
pers dealing with software process models and software reuse. This is followed 
by a large cluster to the right labeled with cluster, intuitive, document, archive, 
text, input containing papers on cluster visualization and its application in the 
context of document archives. Further clusters can be identified in the center of 
the map on plan validation, and quality analysis, neural networks, etc. 

To present a more detailed example, the node representing the abstract 
koh_icann98 (node (7/2) in the upper right part of the map) is labeled with 
the following keywordsi^ait, pattern, platform, aiming, bio-feedback, effects, dis- 
eases, active, compensation, employ. The full text of the abstract is given in 
Figure 4. It is obvious, that the labels derived from the LabelSOM approach 
describe the contents of the paper to a sufficient degree. 



Experiments in Gait Pattern Classification with Neural Networks of Adaptive Architecture 

Monika Kohle, Dieter Merkl 

Abstract: 

Clinical gait analysis is an area aiming at the provision of support for diagnoses and therapy 
considerations, the development of bio-feedback systems, and the recognition of effects of multiple 
diseases and still active compensation patterns during the healing process. The data recorded 
with ground reaction force measurement platforms is a convenient starting point for gait analysis. 
We discuss the usage of raw data from such measurement platforms for gait analysis and show 
how unsupervised artificial neural networks may be employed for gait malfunction identification. 
In this paper we provide our latest results in this line of research by using Incremental Grid 
Growing and Growing Grid networks for gait pattern classification. 

Proceedings of the 8th Int’l Conference on Artificial Neural Networks (ICANN’98), 

Skdvde. Sweden. Sept 2-4. 1998. 

Fig. 4. Abstract of koh_icann98 

Using these labels as class identifiers clear cluster boundaries can be defined 
by combining nodes sharing a set of labels. This results in a total of 8 different 
clusters as shown in Figure 5. 



4 Discussion 

While the labels identified by our LabelSOM method in the text data mining 
example can probably not serve directly as a kind of class labeling in the con- 
ventional sense, they reveal a wealth of information about the underlying map 
and the structures learned during the self-organizing training process. Groups 
of nodes having a set of labels in common help to determine the cluster struc- 
ture within the map. This can be used to provide an improved way of cluster 
boundary and map structure analysis and visualization by grouping and coloring 
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Fig. 5. Cluster identification based on labels 



nodes according to the common set of labels. The benefit of determining cluster 
boundaries with the LabelSOM method lies with the fact that in addition to the 
mere cluster structure, the user gets a justification for the clustering as well as 
information on the sub-structure within clusters by the very attributes. 

The labels themselves aid in identifying the most important features within 
every node and thus help to understand the information represented by a par- 
ticular node. In spite of the little redundancy present in abstracts, the labels 
turn out to be informative in so far as they help the user to understand the map 
and the data set as such. Especially in cases where little to no knowledge on the 
data set itself is available, the resulting representation can lead to tremendous 
benefits in understanding the characteristics of the data set. 

It is important to mention that the information used for the labeling orig- 
inates entirely from the self-organizing process of the SOM without the use of 
sophisticated machine learning techniques which might provide further improved 
labeling capabilities. Still, with the increasing use of self-organizing maps in the 
data mining area, the automatic labeling of resulting maps to identify the fea- 
tures of certain clusters based on the training process itself becomes an important 
aid in correctly applying the process and interpreting the results. Being based 
on a neural network approach with high noise tolerance allows the application 
of the LabelSOM approach in a wide range of domains, especially in the analysis 
of very high-dimensional input spaces. 
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5 Conclusion 

We have presented the LabelSOM method to automatically assign labels to the 
nodes of a trained self-organizing map. This is achieved by determining those 
features from the high-dimensional feature space that are the most relevant ones 
for a certain input data to be assigned to a particular cluster. The resulting bene- 
fits are twofold: First, assigning labels to each node helps with the interpretation 
of single clusters by making the common features of a set of data signals that are 
mapped onto the same node explicit. This serves as a description for each set of 
data mapped onto a node. Second, by taking a look at groups of (neighboring) 
nodes sharing common labels it is possible to determine sets of nodes forming 
larger clusters, to identify cluster and sub-cluster boundaries and to provide spe- 
cific information on the differences between clusters. Finally, labeling the map 
allows it to be actually read. 
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Abstract When using neural networks to train a large number of data for classification, there 
generally exists a learning complexity problem. In this paper, a new geometrical interpretation 
of McCulloch-Pitts (M-P) neural model is presented. Based on the interpretation, a new 
constructive learning approach is discussed. Experimental results show that the new algorithm 
can greatly reduce the learning complexity and can be applied to real classification problems 
with a vast amount of data. 

1 Introduction 

A basic M-P neural model [1] is an element with n inputs and one output. The general 
form of its function is 

T = sgn(wx-(p) 

Where 

X — ,X 2 ' ,X^)^ - input vector 

W = (vPj , VP 2 , • ■ ■ ? ) - weight vector 

(p — threshold 

The node function above can be regarded as a function of two functions: a linear 
function WX — (p and a sign (or characteristic) function. Generally, there are two 
ways to reduce the learning complexity in well-known feedforward neural networks. 
One is to replace the linear function by a quadratic function [2]-[9]. Although the 
learning capacity of a neural network can be improved by making the node functions 
more complex, the improvement of the learning complexity would be limited due to the 
complexity of quadratic functions. The second way to enhance the learning capacity is 
changing the topological structure of the network. For example, the number of hidden 
layers and the number of connections between layers are increased [10]-[12]. Similarly, 
the learning capacity is improved at the price of increasing the complexity of the 
network. 

The proposed algorithm below can reduce the learning complexity and still maintain 
the simplicity of the M-P model and its corresponding network. 

Note WX —(p = 0 can be interpreted as a hyper-plane P in an n-dimensional space. 
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When {wx — (p)>0, input vector x falls into the positive half-space of the hyper- 
plane P. Meanwhile, y = Sgn(wv —(p) = 1 . When {wx — (p) <0 , input vector x 
falls into the negative half-space of P, and y=-l. In summary, the function of an M-P 
neuron can geometrically be regarded as a spatial discriminator of an n-dimensional 
space divided by the hyper-plane P. 

Define a transformation T :D ^ S\xe D , such that 

T (a) = (x, ) 

where t/^rnax |^|a||x G . 

Thus, each point of D is projected upward on the 5 " by transformation T. Obviously, 
each neuron represents a hyper-plane and divides the space into two halves. The 
intersection between the positive half-space and 5 " is called “sphere neighborhood” 
(Fig. 1). When input x falls into the region, output y=l, otherwise, y=-l. Therefore, 
each neuron corresponds to a sphere neighborhood on 5 " with w as its center and 
r{(p) as its radius, a monotonically decreasing function of (p . 




Fig. 1 A sphere neighborhood 

2 A Constructive Algorithm of Neural Networks 

Based on the above geometrical interpretation of neurons, a framework of the proposed 
constructive algorithm of neural classifiers can be stated as follows. 

A set K = ^‘ = {x‘ = 0,1,2, •••j^lof training samples is given. Assume that 

output in K has only k different values, i.e., the training samples will be classified 
into k classes. There is no loss of generality in assuming that the first k outputs have 
mutually different values. Let I(t) he the set of indices of samples with the same output 

y\ t = 0,1, ■■■,k — i.e., I (0 ~ ~ input set corresponding to I(t) be 




240 Ling Zhang and Bo Zhang 



p{t) = e l{t)\t = 0,1, • • • , - 1 . 

The design of a neural classifier can be divided into two stages: the input covering 
stage and the forward propagation algorithm stage. 

2.1 Input Covering Stage 

As a classifier (or an associative memory), the function of a neural network can be 
stated as follows. Given a set K=^V = of training 

samples, after learning, the network should store the input-pairs {x^ ,y^ ') . Moreover, 
when the input becomes X + A , the output still remains the same, where 

A ' is regarded as a noise or an error. For an M-P neuron (w — (p) , when W = X 
and r{<p) = , as mentioned before, it geometrically corresponds to a sphere 

neighborhood on o with X as its center and A as its radius. An M-P neuron 
acts as a discriminator of an input class. Therefore, a network consisting of these 
neurons has the same function as a classifier. 

The first design stage of a neural classifier can be transformed to an input vectors (a 
point set of an n-dimensional space) covering problem. Namely, a set of sphere 
neighborhoods is chosen to cover the inputs having the same output, i.e., belonging to 
the same class. And different classes of inputs are covered by different sets of sphere 

ri n 

neighborhoods. Namely, an M-P neuron also corresponds to a covering on O . 

It means that a set of sphere neighborhoods (coverings) 

is chosen such that C{t) = C^(l) covers all inputs X^ G p{t^ and does not 

cover any input X* ^ p{t^ , and C(?)' 5', ? = 0,1, • • • , — 1, are mutually 

disjoint. 

2.2 Forward Propagation Algorithm 

Given a set of training samples = (a ^ ^ ), ? = 0,1, •••,/? — Ij" , by 

a covering approach, the input vectors have been covered by a set of coverings based on 
their classification. Assume that a set C of p-1 coverings is obtained as follows. 

C={C(i), i=l, 2, 

Assuming that X is the center of covering C(i) and _y ^ is its corresponding output, 
i=l, 2, p-1. For simplicity, ’s are assumed to be mutually different, i.e., one 
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class only corresponds to one covering (When one class corresponds to a set of 
coverings, the constructive approach is almost the same). The domain of each 

component of X (y^) is assumed to be {1, -1 }. Then a three-layer neural network, 
i.e., two layer elements, can be constructed (Fig. 2) by the following Forward 
Propagation (FP) algorithm. 




Fig. 2. A three-layer feedforward neural network 



The network is constructed from its first layer. 

Let the relationship between the input vector x and output vector z of the first layer 
network A be F. The number of elements in A is p-J. Element A. in A is an M-P 

neuron with n inputs and one output, i=l,2, p-1, thus z is an (/t-7)-dimensional 

vector. We have. 

z = F{x) = &gn{Wx-(p) ( 1 ) 



where W is an (p-1 ) Xn-weight matrix, (p -an (p-1 )-threshold vector. And 



w = 






w 



w 



p-1 






Where W -the weight vector of element A- , (p ■ -its threshold. 
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w' =(x)^,i = 1,2, •••,/?-! 

Let < \n- d: + l,if d: is even (2) 

(pi 

[« - d^ ,if dj is odd,i = 1,2,- • •,p-l 

Where dj = mind(x\x^ ),i = 1,2,- p - I, d(x,y) denotes the 

Mi 

Hamming distance between x and y. 

It is easy to prove that for inputs X^ ,X^ ,• • • ,X^ ^ , we have 

z‘ ^F(x),i = 1,2,---,/7- 1 

Where 

z‘ 

z'’^' 

Let 

C(0) = ,x) > = 1,2,-" ,/? - 1 

Since X^ ^ C(i),i - l,2,---,/7 - G C(0). 

Obviously, for X sC(0),F(x) = z“ = F(x'‘). 

We now design the second layer B. 

Let the relationship between the input z and output y of the second layer network B be 

G. The number of elements in B is m. Each element B- in B is an M-P neuron with 
(p-1 ) inputs and one output, i=l, 2, •••, m. We have. 

y = G(z) = sgn(Uz-C) ( 3 ) 

where [/ is an m X(p-1 j-weight matrix, ^ -an m-threshold vector. And 

1 

u 
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i 

Where U the weight vector of element 
Sc C -its threshold. 



Let 



if = y! 

<u‘j=y-, otherwise 



(4) 



We can prove that when formula (4) is satisfied, for p given output vectors 
^|and ,Z^ ,Z^ ^ | , formula (3) holds, i.e., 



In contrast to the well-known BP algorithm, the constructive algorithm above starts 
from the first layer to the last one, so it is called forward propagation. When the 
components of input and output vectors are real-valued rather than { - 1 , 1 } , by making 
small changes, the FP algorithm is still available. 



3. Performances Analysis 

Conventional neural network approaches, generally, have the following shortages. 
There is no general theorem about the convergence of a neural network and the 
learning complexity is very high. It’s difficult to choose a proper number of hidden 
nodes. Now we discuss the performances of the proposed algorithm. 

Hidden Nodes 

Since the number of the first layer elements (hidden nodes) is decided by the number of 
the input coverings, if a (quasi-) minimum coverings can be found, then we have a 
(quasi-) minimal number of hidden nodes. 

Learning Complexity 

The proposed algorithm is a constructive one, so it does not have any convergence 
problem. Its ‘learning’ time consists mainly of the time needed for implementing the 
input covering. 

Since the minimum coverings problems are known to be NP-hard, the key point in the 
design of the proposed neural network is to find a feasible covering approach such that 
a satisfactory result can generally be obtained. A new covering method called 
covering-deletion algorithm and its computer simulation results are given below to 
show the potential to use the proposed neural network in practice. 

3.1. Covering-Deletion Algorithm 

The input covering problem is actually a point set covering problem [15]. It’s known 
that the covering problem is generally NP-hard. Although there are many well-known 
algorithms dealing the problem, an efficient algorithm is needed. Now we present a 
new covering approach called covering-deletion algorithm below. From the 
experimental results, it can be seen that the algorithm is quite efficient. 
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Assume that K={ X^,X^, is an input set, a point set in an n-dimensional space, 

and ={x„ ^ subsets of K. 

Design a three-layer neural network such that for all X. G K . , its output is always 
y , where y, = (0,...,1,0,...,0), i.e., the i-th component is 1, otherwise 0. For 

simplicity, in the following discussion, let s=2, ^, =1 and y ^ = — 1 . 

The general idea of the covering-deletion algorithm is as follows. Find a sphere 
neighborhood (or cover) Cj which only covers the points belonging to . Then 

delete the points covered by Cj . For all points left, find a neighborhood , 
which only covers the points belonging to . Then delete the points covered by 

, ..., until the points belonging to K ^ are covered. Then going back to the first 
class, the same procedure continues class-by-class, . . . until all points are deleted. 

In order to have less number of coverings as possible, each neighborhood C. should 
cover as many points as possible. We alternatively use “finding the center of cover 
Cj ’’and “ translating the center of the cover to a proper position to enlarge its covering 

area” for solving the problem above. A more detailed procedure will be discussed in 
another paper. 

3.2. A Design Approach Based on Covering-Deletion Algorithm 

A design approach, consisting of the covering-deletion algorithm and FP algorithm, 
can be presented below. 

Project all points of and on sphere 5", 5" is in an (M+7)-dimensional 
space, and its radius R > max |x, | . These points are still denoted by and . 

Stepl; Find a neighborhood C{i) (initially, i=l ) which only covers the points 
belonging to . The points covered by C{i) are denoted by a subset 

Let K.. ^ K,\K, , K. ^ K, , where A\B means that x G A and 
xi B , 

A <— B means that A is replaced by B. If or is empty, stop. 
Otherwise, i=i+l, go to step 1. 

Finally, a set C={ C^,C ^, } of coverings is found. 

Step 2: For the first layer, there are p elements, , ..., A^ . Each 
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element A. corresponds to a covering C. . 

For the second layer, there is only one element B with p inputs. Assume 
that its weight and threshold are (u, a). Each U. and a can he obtained 

as follows. Assume that finally is empty and is non-empty, 
corresponds to output 1, and corresponds to output -1. 

—tt <0 ( for the final remainder points of ) 

— <3 > 0 (for the points of covered hy ) 

j + — <3 < 0 (for the points of covered hy ) 

u, + bM, + bM,+...+b u -a>0, 

1 2 2 3 3 P P 

where b. is chosen from {0,1 }. It is easy to see that, no matter what b. is 

a solution of the set of inequalities above always exists. 

Note; The design algorithm can be extended to ^>2 ( more than two classes ) . From 
the set of inequalities above, it can be seen that when the number of hidden nodes 
increase, the weight U. will increase rapidly. Therefore, if the number of hidden 

nodes is very large, in order to reduce the weights, one more layer can be added. A 
more detailed discussion will be in another paper. 

3.2. Simulation Results 

By using the design procedure above, a set of classification problems is tested. Some computer 
simulation results are shown in Table 1. Compared to the results presented in [9][13][14][15), we can 
see that the classification of two-spirals based on BP fails [15], and takes 3,000 iterations and only has 
89.6% correct classification rate by using the generating-shrinking algorithm presented in [14]. 
Fahlman[9] uses a cascade network with more than ten layers to treat the same two-spirals problem, 
and a part of his results was presented in [1 1]. The results shown in Table 1 are much better than those 
results. 



Table 1. Computer simulation results 



Problems 


Number of 
Training 
Samples 


Learning 

Time 


Number of 
Coverings 


Number of 
Testing 
Samples 


Correct Rate 
(for testing samples) 


Two-spirals 


156 




10 


10,000 


98.245% 




20,000 




10 


10,000 


99.995% 


Three- 


234 


0.22s 


24 


10,000 


91.11% 


spirals 


30,000 


90.46s 


26 


10,000 


99.067% 


Three 


183 




28 


10,000 


96.293% 


Spatial 

Spirals 


30,000 




37 


10,000 


99.653% 



Note: * All correct classification rates for training samples are 100% 

*The program was implemented in 486/66 PC under DOS(Window 95) by Mr. Ling Li. 
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Abstract. The Gombinatorial Neural Model (CNM) ([8] and [9]) is a 
hybrid architecture for intelligent systems that integrates symbolic and 
connectionist computational paradigms. This model has shown to be a 
good alternative to be used on data mining; in this sense some works have 
been presented in order to deal with scalability of the core algorithm to 
large databases ([2,1] and [10]). Another important issue is the prunning 
of the network, after the training phase. In the original proposal this 
prunning is done on the basis of accumulators values. However, this cri- 
terion does not give a precise notion of the classification accuracy that 
results after the prunning. In this paper we present an implementation 
of the GNM with a feature based on the wrapper method ([6] and [12]) 
to prune the network by using the accuracy level, instead of the value of 
accumulators as in the original approach. 



1 Introduction 

Classification systems based on symbolic-connectionist hybrid architectures have 
been proposed ([5,7,4] and [11]) as a way to obtain benefits from the specific 
characteristics of both models. The associative characteristics of artificial neural 
networks (ANN) and the logical nature of symbolic systems have led to easier 
learning and the explanation of the acquired knowledge. 

This work addresses one of such architectures, the Combinatorial Neural 
Model ([8] and [9]). In this model, the criterion to prune the network, after 
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training, is based on values of accumulators. However, it does not express directly 
its accuracy level. To adjust this accuracy level it is necessary to set a value to an 
accumulator and observe the corresponding accuracy level. After some tryings 
the user may choose the accumulator that leads to the desired accuracy. In this 
paper we face this problem providing a way to avoid this boring task by means 
of an automatic process to reach the most accurate model without numerous 
manual tryings. 



2 Description of the CNM 

The CNM is a hybrid architecture for intelligent systems that integrates symbolic 
and connectionist computational paradigms. It has some significant features, 
such as the ability to build a neural network from background knowledge; incre- 
mental learning by examples, ability to solve the plasticity-stability dilemma [3]; 
a way to cope with the diversity of knowledge; knowledge extraction of an ANN; 
and the ability to deal with uncertainty. The CNM is able to recognize regulari- 
ties from high-dimensional symbolic data, performing mappings from this input 
space to a lower dimensional output space. 

The CNM uses supervised learning and a feedforward topology with: one 
input layer, one hidden layer - here called combinatorial - and one output layer 
(Figure 1). Each neuron in the input layer corresponds to a concept - a complete 
idea about an object of the domain, expressed in an object-attribute- value form. 
They represent the evidences of the domain application. On the combinatorial 
layer there are aggregative fuzzy AND neurons, each one connected to one or 
more neurons of the input layer by arcs with adjustable weights. The output 
layer contains one aggregative fuzzy OR neuron for each possible class (also 
called hypothesis), linked to one or more neurons on the combinatorial layer. 
The synapses may be excitatory or inhibitory and they are characterized by a 
strength value (weight) between zero (not connected) to one (fully connected 
synapses). For the sake of simplicity, we will work with the learning of crisp 
relations, thus with strenght value of synapses equal to one, when the concept is 
present, and zero, when the concept is not present. However, the same approach 
can be easily extended to fuzzy relations. 

The network is created completely uncommited, according to the following 
steps: (a) one neuron in the input layer for each evidence in the training set; 
(b) a neuron in the output layer for each class in the training set; and (c) for 
each neuron in the output layer, there is a complete set of hidden neurons in 
the combinatorial layer which corresponds to all possible combinations (lenght 
between two and nine) of connections with the input layer. There is no neuron 
in the combinatorial layer for single connections. In this case, input neurons are 
connected directly to the hypotheses. 

The learning mechanism works in only one iteration, and it is described 
below: 



Accuracy Tuning on Combinatorial Neural Model 249 




Fig. 1. The complete version of the combinatorial network for 3 input evidences 
and 2 hypotheses [8] 

PUNISHMENT_AND_REWARDXEARNING_RULE 

— Set to each arc of the network an accumulator with initial value zero; 

— For each example case from the training data base, do: 

• Propagate the evidence beliefs from input nodes until the hypotheses 
layer; 

• For each arc reaching a hypothesis node, do: 

* If the reached hypothesis node corresponds to the correct class of 
the case 

* Then hackpropagate from this node until input nodes, increasing the 
accumulator of each traversed arc by its evidencial flow (Reward) 

* Else hackpropagate from the hypothesis node until input nodes, de- 
creasing the accumulator of each traversed arc by its evidencial flow 
(Punishment). 

After training, the value of accumulators associated to each arc arriving to 
the output layer will be between [-T, T], where T is the number of cases present 
in the training set. The last step is the prunning of network; it is performed 
by the following actions: (a) remove all arcs whose accumulator is lower than a 
threshold (specified by a specialist); (b) remove all neurons from the input and 
combinatorial layers that became disconnected from all hypotheses in the output 
layer; and (c) make weights of the arcs arriving at the output layer equal to the 
value obtained by dividing the arc accumulators by the largest arc accumulator 
value in the network. After this prunning, the network becomes operational for 
classification tasks. 
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3 Accuracy Tuning 

As described above, the prunning of the generated network is done by means 
of the accumulators values, what does not give a precise idea of the resulting 
classification accuracy of the trained network. To achieve a specific classification 
accuracy one has to repeatedly modify the prunning threshold, by trial and error. 
In this section we present a model that uses an implementation of a wrapper al- 
gorithm for accuracy tuning of CNM without this tryings. Examples of successful 
use of wrapper methods are the attribute subset selection algorithm presented 
by John [6] and the algorithm for parameter selection on ISAAC system [12]. 
In our application the algorithm involves the trained network and uses an off 
training dataset as input for test different values of accumulators for prunning. 
This model may be easily extended to incorporate more interesting validation 
methods, as n-fold cross validation. Figure 2 shows how the model carries out the 
accuracy tuning. Considering that the values of the accumulators vary from -T 
to T, being T the number of cases present in the training set, the user have to 
specify the step to be used on the first function of the wrapper: the generation 
of prunning points. For this purpose, the system shows the boundary values of 
the accumulators. 



User choice for 
prunning point 

Final Model 

Fig. 2. The wrapper method for accuracy timing on CNM 



Step for 
prunning 



Test set 




Using the specified step, the algorithm divides the interval defined between 
minimum and maximum accumulators in equal parts (with a possible exception 
on the last part); after this definition the user submit the test set to the trained 
network. It is repeated for each step and the results are reported in a graphical 
form, as depicted in Figure 3. One additional feature is the possibility to focus 
the attention on a particular interval (like a zoom effect), specifying a shorter 
step to refine the accuracy tuning. 

On the X axis it is the prunning points and on the y axis it is shown the results 
on each prunning point. The user may choose one prunning point according to 
the values on the bottom of the screen. In the example, it is selected the prunning 
point eight. Having decided where to prune the network the user orders the 
prunning, obtaining the final network with the desired accuracy. 
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Fig. 3. Prunning points with corresponding accuracy 



4 Conclusions and Future Work 



CNM has received important improvements in order to be more adequate to data 
mining. This improvements has focussed on optimizations of processing time [1] 
and scalability of the algorithm to larger databases [10]. Another important issue 
is the development of tools to help one in accuracy tuning. 

The wrapper method has shown to be useful to set parameters of a model 
using its own learning algorithm. Our contribution with the present work is the 
use of wrapper to help the user to overcome the laborious task of determining the 
prunning point of the combinatorial neural model. By this approach the search 
for the most accurate model has become easier than by trying many alternatives 
of prunning. 

Although the effectiveness of this way of prunning, we believe that this ap- 
proach must be improved by allowing the use of better methods for validation 
as n-fold cross validation. By using such validation methods, one can search for 
an optimum trade-off between the number of remaining rules after prunning and 
the accuracy level of the model. 
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Abstract. When a system will recognize the world around one, sys- 
tem requires segmenting or articulating information of the world. In this 
paper, we introduce a neural network model called VSF network for sit- 
uated articulation. At first, we discuss situation and its relation to infor- 
mation coding. Next, we introduce the architecture and main procedure 
of the VSF network. Finally, we examine the validity of VSF-network 
model through the experiment which applies the VSF-network to a path 
planning task of a rover. 



1 Situated articulation 

A system working in some environment recognizes its surroundings environments 
and it modifies its knowledge to reduce the difference between required action 
and its own action. Extracting a part of information from environment and 
constructing knowledge is major part of learning. There are many ways how to 
extract a part of information from incoming information, and so working system 
in real world should discover most suitable one. We deal with these complex task 
by direction its attention to a most important part of them from the point of 
view of its task and situation[3]. 

Neural network model is an effective way in discovering knowledge from un- 
organized data. From the view of the neural network model, this type of artic- 
ulation is thought as a kind of dynamical symbol generation. The key issues of 
dynamical symbol generation are summarized into following three points. 

1. Timing of articulation. 

When do a system articulating outside information and take in it? 

2. Form of information coding. 

The result of articulation is coded in a certain form. 

3. Relation between timing and the form of information coding. 

2 Relation between Timing and Form 

2.1 Combination 

Information Articulation is an action that a system extracts only worth parts 
of information based on their prior experiences. In this study, we direct our 
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attention to the situation where conflict between system’s recognition and a 
hindrance for it was occurred. The conflict can be thought as a difference between 
system’s knowledge and result of action based on it. Several system have utilized 
this type of conflict for information articulation[6]. 

The result of articulation is not a lump of external information but structured 
one. Those structure of articulate information are summarized as the following 
primitive forms. 

— AND type combination. 

Some information are valid when they are appeared at the same time. 

— OR type combination. 

Some information are valid when only a part of them is appeared. 

We define that occurrence of the conflict between a system’s recognition 
and required action in a situation as the timing of articulation. This type of 
conflict becomes a cue of knowledge reconstruction. On neural network model, 
this difference is defined by a difference between an output of network and teax;her 
signal. On neural network model, the output for an input is computed through 
internal connections between neurons. Those connections depend on global or 
local minimum on an error definition function for weigh update functions. An 
change of the coding form of information depending on the difference will affect 
the coded information structure. The difference between output and teacher 
signal for it determines the coding form of articulated information. The AND 
type combination and the OR type combination depend on the timing that 
sub-pattern of information are co-occurred. If a difference between an output of 
system and teacher signal for it is large, system will search for other solutions 
to minimize the difference. If this is small, system will elaborate their internal 
connection to improve their ability. 

Another prerequisite to develop dynamical symbol system are the representa- 
tion way of sub-patterns of a pattern obtained by neural network and combina- 
tion of way of them. To realize the first requirement, we employed the oscillator 
neural model[l] and a chaotic oscillator lattice model to realize the second re- 
quirements. In the weights update process, each neuron in hidden layer is grouped 
into group expressing sub-structure of articulated information. This dynamics is 
controlled by the difference between input and output and attractors on it. This 
scheme is thought as a kind of the intermittent chaos[5]. 



3 The VSF network 

The overview of the VSF network shown in the figure 1. The VSF network is 
a kind of hybrid neural network that is consisted of two modules called the BP 
module and the LF module. The BP module is a hierarchical neural network 
trained with back propagation process[4], so weights on each layer are thought 
as a result of articulation based on an external evaluation. The LF module is 
two layer network like the Boltzmann-completion network. The purpose of the 
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LF module is that to detect local features on inputs. The knowledge reconstruc- 
tion on the VSF network is the result of two different kinds of process, namelj 
bottom-up process and top-down process. The bottom-up process is based on lo- 
cal feature of input signals. The top-down process is a propagation process basec 
on differences between inputs and teacher’s signal for them. As a result, the in- 
teraction between the local feature layer and the hidden layer become output o: 
the VSF network. 




Fig. 1. The VSF Network 



In the VSF network, neurons at the feedback layer and the local feature layer 
have chaotic behavior so it can be expected that the form of attractor is changed 
by some changes of inputs for the neuron. Dynamics on the feedback layer and the 
local feature layer are thought as a kind of CML(Coupled Mapping Lattice) [2]. 
From recent researches on CML, interaction between chaotic elements generates 
various forms of element group. 



3.1 Processing on VSF network 

The main purpose the VSF-network is the reconstruction of pre-learned knowl- 
edge based on experiences. By the VSF network, pre-learned knowledge is re- 
constructed according to the difference between system’s action and its results. 

The following procedure is the main procedure of the VSF-network. In our 
model, internal status of chaotic neuron is defined by the logistic mapping, and 
output of all neuron is defined by the sigmoid function. 

1. The weights obtained in pre-learning phase are used as initial weights at 
each layers on the BP module. 

2. The weights obtained in pre-learning phase are used as initial weights at 
layers on the LF module. 

3. Apply following procedures for each input pattern. 

(a) Inputs a pattern to the input layer of BP module. 

(b) Inputs the same pattern to the input layer of LF module. 
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(c) The forward process is carried out in the BP module and the error be- 
tween output of BP module and teacher signal is computed. 

(d) The procedure based on the Kohonen- mapping algorithm is submitted 
for the LF module. At first, the winner is determined. Each weights 
between input layer and output layer is updated. 

(e) Internal status of each neurons at hidden layer in the BP module and 
neurons at local feature layer in the LF modules is mixtured and internal 
status of feedback layer layer is determined. 

(f) Effect of chaotic factors on the feedback layer depends on the difference 
between output and the teacher signals. Status of the neurons in the 
feedback layer is updated. 

(g) The steps from (3a) to (3d) will be carry out until each member of neuron 
group become stable. 

(h) Base on the correlation each neurons, transmition factor value is updat- 
ed. 

(i) Each mean status value of hidden layer in period is computed. 

(j) Using the mean status value of hidden layer, the forward process at the 
BP model is submitted again. Each weights on the BP module is updates. 



4 Experiment 

Through an experiment which applies the VSF network to a path planning task 
of a rover, we examine the validity of VSF network model. Each environments 
of the experiment is shown in the figure 2. The input data for the experiment 
is data from a distance sensor on the rover. The teacher signals for the network 
are possibility of cornering of the rover at a T-junction or an obstacle. The basic 
specification of the rover is, 

— the rover has n distance sensor directed to the head direction of the rover, 

— the length and width of the rover can be variable. 

— All weighs prior to the experiments were computed by hierarchical neural 
network with back propagation. 

4.1 Task and results 

This task was designed for conform the VSF network’s learning performance 
when the additional situation and subtractional situation is provided for the 
rover. The training and verification of network’s performance were generated 
under the following two kinds of condition. 

— Additional condition : The task (1) in the figure 2. This task is designed for 
verifying the performance of the VSF network for the AND type information 
processing. 

— Subtractional condition : The task (2) in the figure 2. This task is designed for 
verifying the performance of the VSF network for the OR type information 
processing. 
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Fig. 2. Task of the experiment 



The task for pre-learning is shown in the figure 2 (a-1) and the figure 2 (b-1). 
After the pre-learning, hidden-input weights and hidden-outputs weight which 
obtained at pre-learning phase are provided for both the VSF-network and the 
back-propagation network. 

In the figure 3, each error rate in recognition phase of an additional sequence 
task and a subtractional task is shown. 





Fig. 3. Error rates on the additional and subtractional task 



Form these results, we can see following properties of the VSF network. 

— In both condition, an additional task and subtractional task, our network 
shows better performance than traditional hierarchical network trained with 
back-propagat ion . 
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— In subtractional task, the VSF shows worse results in initial 100 learning 
steps but it improve gradually, 

— Variance of error rates in the subtractional task is higher than all other task 
sequence. Conditions in the recognition phase can be divided into better 
conditions and worse conditions. 



5 Conclusion 

In this paper, we show only result of subtractional combination task. In the ad- 
ditional combination task experiment, the VSF-network shows same tendencies 
as above results. Now, we are analyzing fire pattern at the hidden layer to verify 
hierarchical structure of obtained pattern and trying to apply the VSF-network 
to real-world rover. 
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Abstract. Databases storing real data contain complex patterns, such as chaotic 
pattern which generally are characteristics of greatly random fluctuation that 
often appear between deterministic and stochastic patterns of knowledge 
discovery in database. Chaotic patterns are always treated as random fluctuation 
distributions and ignored in literature so far. A novel network approach to 
discover and predict chaotic pattern in databases is proposed in this paper, which 
together with Zytkow’s Forty-Niner can not only discover the chaotic pattern but 
also predict it efficiently. In addition, this approach is very suitable to deal with 
large databases and has extensive applicable prospects in the vivid research 
fields of KDD. 



1 Introduction 

Automated knowledge discovery in large databases (KDD) has been extensively 
concerned in recent years. There exist many large databases that are constructed by the 
kind of data recording various natural and social phenomena. Relational data within 
one attribute or between different attributes of these databases usually lay out to be 
greatly random fluctuation and disordered. Researchers showed that many natural and 
social recordings (such as the rise and fall of stock price and the change of the weather) 
are all disordered and mixed^^'. However they can be almost described by chaos yet. 
The seemingly disordered and mixed relations can be easily confused with random 
relations since they share many similar characteristics. In fact, among these random 
relations there might exist chaotic patterns that result from simple nonlinear systems. 

Concept, digital logic pattern, elementary pattern, statistical pattern and regularity 
are all the types of relations representing the knowledge in KDD. These relations are all 
deterministic patterns whatever they are expressed in what manner. Besides, other 
relations are all treated as completely irrelevant ones that are called random fluctuation 
relations. Yet in the random fluctuations there obviously exist a baffling range of 
relations that are different from pure noisy relations. Are there any useful patterns in 
these relations? You may ask. Theories for chaos reveal an evident that between 
deterministic patterns and random relations there exists a middle pattern relation, which 
is referred to as chaotic patterns by authors in this paper. So-called chaotic pattern is a 
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time sequence or an attribute relation that can be rebuilt by a low-order nonlinear 
dynamics' system. Since it distinguishes from both deterministic pattern and pure 
random relations, chaotic pattern, which can be understood as a generalization to the 
existing deterministic patterns in KDD, is an important class of patterns in the nature. 

The establishment and development of the chaotic and fractal theories have 
provided with solid basis and sufficient tools for finding and understanding the 
disordered law (i.e., chaotic relation). If a disordered relation is judged as chaotic 
relation, it could be also predicted within short time in high accuracy by nonlinear 
techniques. Evolution of laws hidden in the chaotic pattern can be dug out and 
reconstructed by discovering chaotic pattern. Moreover discovery of chaotic patterns 
can find laws according to the character of chaos and recast the existing data into some 
expressible frame. Thus it provides a completely new approach and train of thought for 
the researches of the intra- and inter-attribute relations in KDD. 

Since the available methods are hard to deal with chaotic patterns, they are all 
treated as random relations. It is well known that statistic technique is one of the 
important types of methods to estimate the attribute relations. Statistics in KDD has 
developed many methods for pattern searching and evaluating which are efficient in 
analyzing various relations Nevertheless it can not efficiently depict the chaotic 
attribute relations. Rather it usually disposes of chaotic patterns as random relations. 
49er (Forty-Niner) proposed by J. Zytkow^^' is one of the most famous methods to find 
useful patterns in databases based on statistics. Analyses and experiments show that 
49er is inefficient to find and depict the chaotic patterns and treat them as random 
relations as well. In this paper, a chaotic pattern discovery net (CPDN) is proposed for 
our purpose. Based on the relevant work^”*', CPDN has many distinguishing advantages 
as given below. Combined with 49er, it can not only discover chaotic patterns but also 
predicate the future trend of the patterns. 

2 Chaotic Pattern and Neural Network Approach 

The chaotic pattern can be found both within an attribute and between different 
attributes in databases. Chaotic pattern is a special kind of dynamic relation that 
behaves between deterministic and random relations. For better comprehension of the 
chaotic pattern, we here discuss them in three aspects: correlativity, prediction length 
and Fyapunov exponent. It is well demonstrated that deterministic patterns are 
characterized by strong correlativity, very long prediction length and negative 
Lyapunov exponent. Whereas, pure random relations along with infinite value of 
Lyapunov exponent, do not have any correlation so that it can not be predicted. 
However, Chaotic patterns are just between them, which is characterized by weak 
correlation, short prediction length and finite positive Lyapunov exponents. 

It has been proved that the approaches of statistic correlation analysis are efficient 
in discovery of non-chaotic patterns in databases in KDD field. It is pointed out in [3] 
that 49er has a good recognition capability for 2-D patterns. One hundred random 
fluctuation patterns were tested by 49er and only one was mistaken as deterministic 
pattern. But there might exist some chaotic patterns in the 99 patterns that can not be 
picked up by 49er. In this paper, we intent to preliminarily distinguish deterministic 
patterns from random relations by 49er and then search for chaotic patterns among the 
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random relations by a connectionist model if there exist chaotic patterns. 

Chaotic patterns always appear to be disordered and are shown to he similar to 
random patterns to some extent in statistic research. As a result, chaotic patterns usually 
were classified into and treated as random fluctuations in the past. Thus other 
approaches are needed to he invented to discover and forecast chaotic patterns 
efficiently. Non-linear is the fundamental reason to generate chaotic patterns. Though 
they are seemingly disordered, chaotic patterns obey certain laws and behave some 
correlation in the time delayed state space which gives some memory ability in intra- 
and inter-attributes. However, this correlation is difficult to be expressed by relation 
logic, symbolic processing and analytical methods. Neural network just possesses such 
ability of information processing. 

This paper proposes a chaotic pattern discovery net (CPDN), which belongs to a 
kind of RBF-hased neural network approach. This kind of chaotic pattern can occur 
both intra- and inter-attributes. CPDN can find and learn the chaotic relationship 
between attributes. 

Conventional RBF network approach adopts a predetermined network structure 
(e.g., number of basis functions) according to the given problems. This approach is not 
able to deal with the problem of discovering useful patterns in databases because there 
are two basic problems in KDD: Is there a pattern? What is the pattern in database? But 
further this approach is not also suitable to large-scale database. 

The proposed CPDN is a two-layer feedforward neural network. The activation 
function of the hidden layer is of the radical basis function, such as Gaussian function. 
The network output is the linearly weighted sum of all of the output of hidden nodes. 

The main features of the proposed CPDN algorithm can be given as follows 

(1) The hidden node of CPDN is dynamically allocated during the learning process. 
Three rules, which are learning error, the nearest distance between input and the 
existing pattern and the change of belief degree, respectively, are used here to 
determine whether the network need adding a new hidden node. Only when the three 
rules are satisfied can the network be added a new hidden node. The former two rules 
control the case when there is novel information in the coming data and the network 
unable to contain it. The third rule indicates that the newly arrived data must contain 
some interesting patterns we concern if degree of belief changes. These three rules 
together with the following pruning mechanism is very useful to determine whether 
there exists a pattern or not. 

(2) CPDN is of pruning mechanism of basis function. Because the RBF neural 
network has local properties, the added hidden nodes only response to the local space 
and ignore the equilibrium of the whole space and sometimes result in redundant 
hidden nodes. Here we add a punishment factor ^ (0<^<1) to the weight between 
hidden nodes and output nodes when the weight is small enough, since very small 
weight means that corresponding synaptic connection contributes little to the net’s 
output. Hence pruning is conducted when the weight is smaller than a preset threshold. 

(3) CPDN also has good prediction ability to many complex patterns due to its good 
generalization performance. 
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3 Simulation and Analysis 

To reveal the capability of detecting chaotic patterns among complex patterns in 
databases, we generate the following four kinds of relations for simulation test. 

( 1 ) Ai, A 2 is in random relationship (where Ai is generated randomly in uniform 
distribution. A 2 is generated similar as Ai but independently) 

( 2 ) Ai, A 3 is in simple functional relationship, = a + bAi^ , where a and b are 
real and constant number respectively (generated in similar manner as [3]) 

(3) Ai, A 4 is in complex functional relationship that is in the form of 
A4 = a + bA^^* exp(A,^ )• 

(4) Chaotic relation between attributes which are generated by Mackey-Glass 
sequence. 

First, we calculate significance Q and Cramer’s V coefficient of the above four 
patterns, which are illustrated in Fig. 1 and Fig. 2, respectively. 





Fig. 1. Curves of Q versus record number Fig. 2. Curves of V versus record number 

With regard to random relationship, Q value is always much larger than a threshold 
of 10'^, and V value always near to zero. For deterministic patterns, Q is below the 
threshold and V is close to 1 all the time. However, Q and V of chaotic pattern, for its 
complexity, change with record number, which behave in similar manner of 
deterministic patterns for small records while behave gradually as random relationship 
as record numbers increase. It can also be seen from Fig 1 and Fig 2 that chaotic pattern 
indeed locates between deterministic pattern and random fluctuation. Nevertheless, 
49er can not discover the chaotic pattern and classify it into a random relationship. 

Next, we apply CPDN to discover the three kinds of patterns (such as deterministic 
pattern of Ai and A3, random relationship of Ai and A2 and chaotic pattern). In the 
learning process, each pair of attribute values in databases is used as input and output 
of the net, respectively and the samples are sequentially presented in the input-end. The 
change of hidden nodes of the network during the learning process is plotted in Fig. 3. 

In Fig. 3, there show three curves reflecting the change of hidden nodes when the 
CPDN is used to learn the three kinds of patterns. The curve in dashed line denotes the 
change of hidden nodes of CPDN in learning deterministic pattern of Ai and A3, while 
the curve in solid line the random relationship between Ai and A 2 . The hidden node 
change for chaotic pattern between two attributes of attri -1 and attri -2 is also drawn by 
thick solid line. It is shown that for the random relationship the hidden nodes increase 
as the input records increase all the time and the curve changes acutely and irregularly. 
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Because there is short of correlation between data of random relationship, and each 
new comer is novel to the system, the CPDN must allocate additional hidden node to 
adapt them. For deterministic relationship, the number of hidden node changes steadily 
and gradually reaches a stable state. As there exists a strong correlation for the 
deterministic pattern, the pattern character must be limited. Therefore, CPDN can 
express the pattern efficiently by finite number of hidden nodes, and to filter out such 
patterns from random fluctuation. For chaotic pattern, the hidden nodes of CPDN 
become unchanged at 1300 observation samples during the learning process, which 
means it is enough to express the chaotic pattern by 24 hidden nodes in this simulation. 




Number of Observation 



Fig. 3. Curves of hidden nodes versus number of observation 

4 Concluding Remarks 

The chaotic pattern is a class of useful patterns existing in databases but was ignored 
before, which is the embodiment of the natural complexity. This paper proposed a 
chaotic pattern discovery net based on neural network methodology to detect and 
discover the chaotic pattern together with static method from complex patterns in 
databases. 
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Abstract. Many data mining techniques have been developed and 
shown to be successful in financial domains. A further aim is to make 
sense of numerical data through a human-friendly way, by which 
general patterns are extracted in terms of linguistic concepts. Problems 
associated with the linguistic mining approach are the effective 
representation and the validity preservation of the linguistic patterns. 
The volatile data may vary linguistic concepts and make previously 
discovered patterns invalid. This paper aims to solve the problem. 
Based on the cloud model proposed in our previous works, linguistic 
patterns can be represented effectively. Outdated linguistic patterns can 
be valid by a GA-based validity preservation technique in line with 
current data set. An example of Hong Kong stock market is given to 
illustrate how the technique works. 



1 Introduction 



Linguistic Pattern (PI). If 3 -days accumulated change of index is positive and 
volume maintains heavy, then the short-term market tendency is up. Meanwhile a 
potential drop is implied when the accumulated gain becomes big positive and the 
volume reduces from heavy to weak. 

In the prediction, some linguistic concepts, e.g. “positive”, “heavy” and “up”, are 
used to describe the changes of stock index, the market volumes and the tendency 
instead of the accurate numbers. The advantages of linguistic patterns are that they can 
carry more information then numerical patterns, and easily understood by users. 

There are two problems associated with the linguistic mining approach. The basic 
one is to represent linguistic concept effectively. This problem is solved by so-called 
cloud model proposed in our previous works [4,5]. 
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The other problem dues to the volatility of financial data. Rapidly changing data 
can make linguistic concepts outdated in different time periods and further invalid 
previously discovered linguistic patterns. The mapping of data^linguistic concept 
become invalid even the linguistic concept-^pattem still correct. This is the problem 
the paper arms to solve. 



2 Linguistic Representation Based on the Cloud Model 



Three kinds of linguistic items, individual linguistic term, linguistic variable and 
linguistic pattern, need to be represented. For instance, the example PI is a linguistic 
pattern with linguistic variables of a-change, volume and tendency, and each variable 
associates with several linguistic terms, such as positive and big positive for a-change. 
heavy and weak with volume. 

Represent Linguistic Term. The cloud model has been defined in [4]. It can quantify 
a linguistic term by three numerical parameters: expected value Ex, entropy En and 
deviation D, e.g. heavy (24,3,0.01) with Ex = 24, En = 3, D = 0.01. Ex gives the 
position at U corresponding to the gravity center of the cloud. En measures how many 
elements in U are accepted by a linguistic term. D measures the randomness of 
membership degrees in the cloud. The mathematically expected curve (MEC) of a 
cloud is defined as: 



(u-Exf 

MEC{u) = e^ ( 1 ) 

Given Ex, En and D, a so-called cloud generator can generate drops(m, jM), where 
Ml is a value in U, and jM is the membership degree of m belongs to the linguistic term. 
A specific drop(u, ji) is generated if either a value m or a degree ji is given. 

Represent Linguistic Variable. A general format, v {Ti(Exi Enj Dj), ...,T„, (Ex„, 
En^, D„)}, is used to represent a linguistic variable v contains m linguistic terms. For 
example, three linguistic variables used in stock data analysis are defined as: a-change 
{big-negative(-450, 50, 0. 03), negative(-200,90, 0. 04), neutral(0, 60, 0. 02), 
positive(200, 90, 0. 04),bigpositive(450,50, 0. 03)} ;volume{weak(l 5,2, 0. 02), 
moderate(19, 4,0.04), heavy(25, 4,0.05)}: and tendency {down(-200, 100,0. 04), 

neutral(0. 60. 0. 02), up(250, 80, 0. 04)}. 



Represent Linguistic Pattern. A linguistic pattern is represented based on the 
linguistic variables and their relationships. 
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/ +642 



Input 



\ 20 




Fig. 1 Linguistic Pattern PI Represented by Cloud Generators 



Fig.l shows the representation of the above example PI, where cloud generators 
of a-change and volume are used for the two inputs of 3-day accumulated change and 
market volume respectively. The output of the pattern is the prediction of market 
tendency, which is represented by the generator of tendency. The operator is “soft- 
and” is defined as Eq. 2, where z is the result of x “soft-and” y. 



yV 


2 




l 2 J 




1 2 J 



( 2 ) 



When a new pair of input, say <+642, 20> for a-change and volume respectively, 
is given , the linguistic terms triggered by the inputs and their associated compatibility 
degrees, are produced by the cloud generator, they are (big-positive, 0.87) and 
(heavy, 0.92) in Fig.l. The operation of “x soft-and y” is preformed obtaining (soft- 
and, 0. 79). The final output of the market tendency is {up, 0. 76). 



3 Outdated Linguistic Variable and Invalid Linguistic Pattern 



Data from Hong Kong Stock Exchange (HKSE) is given to be the example data 
set and shown in Fig. 2. It is a combined chart of 3-day a-changes of Hang Seng Index 
(HSI) and market volumes during time period from January 1996 to December 1997. 




12/96 3 6 9 12/97 3 6 9 12/93 



Fig. 2 Combined chart of 3Days HSI Accumulated Changes and 
Volume of HKSE 



The historical data in a time period, e.g. in 1997, can be used to test the validity 
of linguistic patterns. For PI, the prediction is considered to be correct if the market 
moves to the same direction of the predicted tendency. Fig.3 (a) is the result when 
applying PI to the data set in 1997, which indicates a big rising market with 92% “up" 
and 4% “down". However, a distinct real behavior, shown in Fig. 2, is that the stock 
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market rises generally in the first half year but drops sharply in the second half year. 
Obviously, the linguistic pattern is invalid with respect to the data in 1997. 



predictetd tendency in 1997 



neutx'-al down. 




vip 9 Z^ 




Fig. 3 Outdated linguistic variable invalid the linguistic pattern 



The reason can be found if we look at performances of linguistic variables. 
Fig.3 (b) shows the linguistic variable “volume” with respect to the same data set, 
where “heavy” is dominated and “weak” is almost zero. This result obviously 
conflicts with the real situations shown in Fig.2. Since the linguistic terms do not 
refresh their mathematical definitions accordingly and become outdated in current 
data set, the linguistic pattern will be invalid and result in distorted conclusions. 



4 Refresh Linguistic Variables 



Outdated linguistic variable can be refreshed by adapting its numerical parameters 
to the current data set. The statistical characteristics of the data distribution in are used 
to optimize the parameters. Since parameter D does not influence whether an element 
belongs to a linguistic tenn or not, only Ex and En are needed to be considered. 

The function/is synthesized by three sub-functions of fa., ft and fc to evaluates 
individual terms, the relation of two neighbor terms and the data percentage covered 
in each term. The function/is defined as Eq.3 where m is the number of terms in the 
linguistic variable. 

m 

/ = (X/.(i?V-,^«,) + A(v) + /,(v))/3 (3) 

1=1 

Function fa(Ex, En). Function ffEx.En) is a combination of fai, fai and fai, which 
are defined to evaluate the symmetry, the concentration of its kernel and the 
sparseness of its margin. 



f (Ex,En) = (Ex, En) + 2/, (Ex, En) + f, (Ex, En) )/5 (4) 

fai(Ex,En) = '^Vote(x,) / '^Vote(Xj),x, e L,Xj e R (5) 

i=l / 7=1 

where X, is an element of data set n is its total number, Vote(Xj) is the 

frequency ofx,- , L= [C-Ex - 3C-En/2 , C-Ex ] , R=[C-Ex, C-Ex + 3C-En/2], 



f„^{C-Ex,C-En) = l- 



^ Vote(Xi ) / X Vote{x j ) - 0.68 



X,. e B, X . e A 



( 6 ) 



where , B^C-Ex -ffjfEn, C - Ex + -JfJlEn ] , A^[C-Ex - 3C-En/2 , C-Ex + 3C-En/2] 
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f^,{C-Ex,C-En) = \- 



^ Vote(x^) + ^ Vote{x.) | j ^ Vote(x^) 

i=i y=i 



-0.4 



(7) 



Xj e PR, X. e PL,x^ e A 

where PR = [C-Ex + C-En, C-Ex + 3C-En/2] , PL= [C-Ex - 3C-En/2, C-Ex - C-En] 



Function /* (v). The function ft, calculates the redundancy between two neighbor 
terms, which is measured by the intersection of their clouds. 

m 

fb(v) = Y,(M^bb(Ex„En,),f,^(Ex,^^,En,J))/m ( 8 ) 

/=i 

Where fib is to calculate the intersection between Ti(Exi, Em) and Ti+i(Exi+i,Em+i). 



fi^(Ex„En,) = 



0, X 6 basic group. 
2x - 2Ex, - En, 



En. 

1, X e weak periphery, 



, X 6 periphery, 



_ {Ex,En,^,)-{Ex,^,En)) 
En,^, —En, 



(9) 



Function fi(v) . The fi measures the data percentage covered by linguistic variable v. 
it is defined as the sum of the data percentages covered by all linguistic terms. 



Y/ote{xfi 

m = ^ 

^Vote(x,) 



S,-lJS.,i = L.m 



( 10 ) 



where S/ = [C-Exj- 3Eni/ 2, Exi+ 3Eni/2J. 



5 Experiment and Discussions 

Applying the algorithm to data from January to December 1997, the Ex and En of 
volume {(weak(l 2,6,0.03), moderate(l 5,8,0.04), heavy(l 8, 10,0.05)} in 1996 is refreshed in 
line with the data set of July, August and November 1997 respectively (Fig. 4-Left), 
i.e. 

{(weak(16,8),moderate(19,10),heavy(22,16),{(weak(22,10),moderate(27,18),heavy(35,20)}, 
and {(weak(7,3),moderate(8.2,6), heavy(l 0.6,7)}. 







Predicted tendency and real movement of 
HSI in 1997 
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Fig. 4-Left. Linguistic Variable volume in July, August and November 1997 

Fig. 4-Right The comparison between the predictive tendency and real movement of HSI 
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The prediction obtained by the refreshed pattern is shown in Fig. 4-Right, 
together with the real movement of HIS. It is shown that the market slide in the first 
quarter and soar during the middle and fall sharply in the last quarter, which generally 
matches the real stock market. It shows that the validity of linguistic pattern is 
preserved. 
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Abstract. This paper gives a formulation of inductive learning based 
on fuzzy logic programming (FLP) and a top-down algorithm for it by 
extending an inductive logic programming (ILP) algorithm FOIL. The 
algorithm was implemented and evaluated by experiments. Linguistic 
hedges, which modifies truth, are shown to have effect to adjust classifi- 
cation properties. The algorithm deals with structural domain as other 
ILP algorithms do and also works well with numeric attributes. 



1 Introduction 

Inductive logic programming (ILP) gathers attention from KDD researchers. It 
has been successfully evidenced to work for classification problems with struc- 
tural domains, such as problems in natural language processing and mutagenicity 
of chemical compounds)!]. In spite of the success, ILP has less applicability for 
domains with numeric data. Several methods to handle numbers are proposed. [2] 
classifies real numbers by tolerance ranges. [3] and [4] inherit a method from 
propositional decision tree learning. [-5] is based on constraint logic program- 
ming. 

The algorithm given here handles numeric data and other structural data 
based on fuzzy logic programming (FLP). Although a membership function fixes 
a treatment of numbers, combining linguistic hedges has effect to supply several 
membership functions, or several ways classifying examples. 

2 Inductive Fuzzy Logic Programming 

Several frameworks of FLP has been proposed (e.g. [6,7]). Here, we define a 
framework of inductive fuzzy logic programming (IFLP) mainly obeying [7]. 

Fuzzy sets extend the true/false membership to continuous membership in 
[0, 1]. When p.A and fiB denote the membership functions of fuzzy sets A and B, 
A\JB, AC\B and A are defined by plavjb{x) = max(/iyi(a^), /is(a;)), HAnsix) = 

N. Zhong and L. Zhou (Eds.): PAKDD’99, LNAI 1574, pp. 268—274, 1999. 
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fisix)) and = 1 — ^j,a(x), respectively. We write an element a 

in A as a/fj,Q when = fj,Q. The difference A — B is AC]B. sp(A), the support 
set of A, is {a | a/fj. € A, fj, > 0}. 

A fuzzy logic program P is a fuzzy set of horn clauses. The semantics of a 
program is given by a fuzzy Herbrand interpretation (FHI), a fuzzy subset of 
the Herbrand base. For an FHI I, I (A), called the truth of A in I, denotes the 
membership of the ground atom A in I. Truth of a ground clause is given by 

/(A ^ Pi A • • • A P„) = min{n + I{A) - /(Pi) /(P„), l}b (1) 

Then, the truth of a clause c is defined by /(c) = min^ s.t.cd is ground .^(c0)^. 

An FHI / is a fuzzy Herbrand model (FHM) of a program P (denoted by 
I \= P), if Vc € sp(P),/ip(c) < /(c). The least® FHM (LFHM) of P, denoted 
by Mp, can be characterized by almost the same procedure in [8] for LP. For a 
ground fact A//i and a program P, we can see that I \= A/ p for every FHM I 
of P iff Mp h A/p, for LFHM Mp of P. Pi h P 2 if Ip^ h P 2 for every FHM /p, 
of Pi, equally Mp^ \= P 2 - 

A problem of inductive fuzzy logic programming (IFLP) is to give a fuzzy 
logic program H, called a hypothesis, satisfying 

(a) Ve®' S P®", PUP \= e'^/p'^, and (b) Ve“ € P“, PUP ^ e~ / p~ , (2) 

when a fuzzy logic program P, called background knowledge, a pair (P“*", E~) of 
sets of (crisp) ground facts of a target predicate R, and constants p~^,p~ G [0, 1] 
are provided. Facts in P+(P“) are called positive (negative) examples of R. 

3 A Top-Down IFLP Algorithm 

By raising up operations in FOIL [3] to a fuzzy version, we propose an algorithm 
(Fig. 1) for IFLP problems with restrictions: (1) P does not include R, (2) P 
includes only clauses whose heads have R, (3) P does not include clauses defined 
recursively, and (4) all membership of clauses in P are 1. 

The algorithm greedily finds clauses that do not cover negative examples 
with the membership more than p- until every positive example is covered by a 
clause more than pp . To find such a clause the inner loop keeps adding a literal 
to the body of clause while the condition in Line 6 satisfies. 

For a crisp clause c cove red (c) defines a fuzzy set 

covered(c) = {e/ p \ e € U E~ , p = Msu{c/i}(e)}, (3) 

which updates remain in Line 11 to check the condition (2-a) in Line 4. 

The condition for the inner loop is also expressed by covered (clause) in Line 6. 
The algorithm keeps to add literals to clause while it holds. By the conditions in 
Lines 4 and 6 theory satisfies the conditions (2-a) and (2-b) obviously. 

^ This rule is based on Lukasiewicz’s conjunction and Lukasiewicz’s implication [7]. 

^ 9 is a, ground substitution. Application of 6 to c is denoted by c9. 

® in the sense of fuzzy set inclusion, i.e. A < P if A C P or pa{x) < pb{x) for all x. 
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1 Initialization 

2 theory := a null program 

3 remain := {e”*"/! | e”*" is a positive examples of R} 

4 While /^remain 1 some e”*" € sp(remain) 

5 clause := a;„g) <— /r 

6 While Mcovered(clause)(®~) ^ some negative example e 

7 Find an atom I by using gain 

8 Add I to the body of clause 

9 EndWhile 

10 Add clause to theory 

11 remain := remain — covered(clause) 

12 EndWhile 

Fig. 1. An outline of an IFLP algorithm 



To extend gain, heuristics to evaluate literals in FOIL, we define fuzzy sets 
corresponding to sets of assignments of variables in clause. An assignment is a se- 
quence (oi, 02 , • • •) of values assigned to variables X\,X 2 , ■ ■ ■ occurred in clause in 
this order. For the initial body-less clause, fuzzy sets and of assignments 
for positive and negative examples are 

Aq = |(oi , - ■ 071^)/! |i7(oi, ■ ■ ■, Otiq) F/ j- , Aq = |(oi , ■ ■ ■, ^R[ai,--,anQ)^E J- . 

For A'l , assignments after adding an atom"* I = P{xi ^ , • • • , Xi^,) are: 

_f(ai, ,a„+m) £ A+ and 0 is a ground substitution 1 

which is consistent with (oi, -, a„)J ’ ' 

where xi, - ■ ■ ,Xn are variables in clause before adding I and Xn+i, • • • , Xn+m are 
new ones introduced by 1. A~f_^ is similarly defined by replacing all -|- by — . 

For sets A+ and A~, the information for finding an assignment is in A+ is 

z^o(A+,A-) = -log 2 (f+/(f+ + f-)), (5) 

where 

e sp(A+) max {o, Mi+((ai,..., o„)) - M 7 ^^((ai, a„„))} , 

^ ~ X](ai,---,o„)esp(A~) Ma- ® n))- 

Intuitively T'*' is the total of membership of assignments for positive examples 
covered by clause but not by theory. For T~, all covered assignments are con- 
sidered. The first uq of oi, • • • , a„ are the assignments for variables in the head. 
Then gain, a fuzzy version of gain, expressing the information gained by I is 

gain{l) = kx ^info{A'^ , A~) — info{A'~^ , A'~)'^ , (6) 

^ Our algorithm for IFLP allows only positive literals or atoms. 
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when A+ and A become and A' by adding I, where k is calculated by 
..,a„)esp(A+) max|o, maxtruth^,+ {{ai,...,an)) - ^ remain ^^^^’' ’ 

(ai,-.’-, d„ "■■) GSp(A'+) 

where maxtruth^,+ {{ai, --, a„)) = max a=(ai,...,a„,.. ) Ma'+( “) (®ii , an, --) 
represents an assignment also for new variables introduced in 1. 



4 Implementation and an Example Problem 

The algorithm is implemented as a system FCI. It combines top-down ways like 
SLD-resolution for background knowledge and and bottom-up fixpoint calcula- 
tion for hypotheses in order to calculate LFHM. 

FCI allows use of linguistic hedges very and more-or-less, only in hypothe- 
ses, but not in background knowledge. They modify the truth as I(veryZ) = 
I{1)^ and /(more-or-less Z) = ^I{1). 




Fig. 2. Positive and negative examples for good-arch. 



Fig. 2 illustrates an example IFLP problem to learn a predicate good-arch to 
specify well-constructed arches. For this problem FCI induced the following hy- 
potheses with an input file (Fig. 3). 

good-arch(A) <— has-archblk(A, B) A well-proportioned(B). 

good-arch (A) «— has-roofblk(A, B) A has-supportblks(A, C, D) A well-proportioned (B) 
Averybalanced-distance(C, D) A support(B, B) A support(C, B). 

In the file, {very} in the balanced-distance line allows to combine the linguistic 
hedge very with this predicate, truth defines membership for the clauses includ- 
ing it. They are of the form ‘truth(?7i)’ for truth m, or ‘truth(V, a;i, ?/i, • • • , 
Xk, yk,n)' for a membership function of a variable V, constituted by linking the 
points (xi,yi), • • • , (xk,yk) and by calculating the n-th power. 
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[target] 



% type and mode of target predicate 



good-arch (+) . 
used in hypotheses 
has-roofblk(+,-) . 
has-supportblks . 
has-archblk(+,-) . 
support (+,+) . 
balanced-distance (+,+) . 
well-proportioned(+) . 



good-arch (arch) . 

[prototype] % predicates that can be 

has-roofblk(arch,bl) . 
has-supportblks(arch,bl,bl) . 
has-archblk(arch,bl) . 
support (bl ,bl) . 

{veryjbalanced-distcLnce (bl ,bl) . 
well-proportioned (bl) . 

[knowledge] % definition of background knowledge 

has-archblk(arch5 , a5) . has-archblk(arch6 , a6) . has-archblk (archil , all) . 
has-roofblk(archl ,rl) . has-roofblk(arch2,r2) . ... 

has-supportblks (archl ,blla,bllb) . has-supportblks (arch2 ,bl2a,bl2b) . ... 

support (blla,rl) . support (bllb,rl) . ... 

distance (blla,bllb , 11 . 0) . disteince(bl2a,bl2b,9.3) . ... 

length(blla,5.9) . length(bllb,6.0) . ... 

thickness (blla, 1 . 45) . thickness (bllb, 1 . 5) . ... 

balanced-distance (BLl ,BL2) : -distaince (BLl ,BL2 ,D) ,truth(D,0 ,0 , 10,1,30,0,1). 
well-proportioned (BL) : -medimn-length(BL) ,medium-thickness(BL) . 
medium-length (BL) : -length (BL ,H) , truth(H,0 ,0 ,3 ,0 ,6 , 1,10,0,1). 
medium-thickness (BL) : -thickness (BL ,TH) ,truth(TH,0, 0, 0 .8, 0,1. 5, 1,2. 5, 0,2). 



[positive] % positive examples 

good-arch(archl) . good-arch(arch2) . ... good-arch (arch6) . 
[negative] % negative examples 

good-arch (arch7) . good-arch(archS) . ... good-arch(archl2) . 



Fig. 3. An input file for learning good-arch. 



5 Experiments 

We conducted three experiments with datasets Animal from examples of 
Progol[10] and Iris and Labor Negotiations from UCI machine learning data- 
bases [9]. 

Animal dataset successfully evidenced FCI to work as an ILP algorithm. 
Table 1 summarizes the results of two others. Results of FCI and FOIL-6.4 
consists of ten complete five-fold cross-validations. The result of C4.5 was taken 
from [11]. For the induction with FCI we gave three different fi^ and /r_. For 
the test of unseen cases with FCI, we used the threshold 0.5 for classifying truth 
of cases. FCI classified both datasets with better accuracy than others. 



Table 1. Error rates (%) and complexities of hypotheses induced. 



abbrev 


error rates 


comp 


exities 


FCI 


C4.5 


FDIL 


FCI 


C4.5 


FOIL 






clauses 


atoms 


nodes 


clauses 


atoms 


iris 


(0.7, 0.3) 
(0.6, 0.4) 
(0.5, 0.5) 


4.73T.15 

5.33F.0 

5.33T.0 


4.87 

±.17 


6.13 

±.39 


2.20±.08 

1.46±.06 

1.00±.00 


3.52±.16 

2.00±.01 

1.00±.00 


8.50 

±.00 


3.6 

±.09 


6.96 

±.20 


labor 


(0.7, 0.3) 
(0.6, 0.4) 
(0.5, 0.5) 


18.4T1.2 

16.0T1.0 

14.7F1.0 


19.1 

±1.0 


20.2 

±1.3 


2.54±.05 

2.48±.08 

2.42±.05 


6.04±.19 

5.74±.19 

5.86±.14 


7.00 

±.30 


3.02 

±.06 


5.46 

±.11 



Clauses, atoms and nodes means the numbers of clause, atoms and nodes induced. 
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Table 1 also shows complexity results. The numbers of atoms for FCI and 
FOIL correspond to nodes of C4.5. FCI induced smaller programs than others. 

6 Concluding Remarks 

We proposed a framework IFLP of inductive learning based on FLP and an 
algorithm to solve IFLP problems. The algorithm was implemented as a system 
FCI and it is evaluated with three datasets. The system provides a description 
framework of background knowledge in FLP. This can be expected to work for 
learning problems in complex domains. 

The power of linguistic hedges is remarkably strong for induction. They are, 
however, important in representation complexity. When we describe a constraint 
for a value, we normally need two steps (literals) : the first literal picks a value in 
a variable and the other makes a restriction for it, e.g. length(Obj, Len) ALen < 3. 
On the other hand we can describe this with only a literal, e.g. small-length(Obj), 
in fuzzy logic. The importance is that we need a variable to write a condition 
and have a variation of the condition, e.g. Len < 5, in normal logic, but we 
do not need any variable for this in FLP. We can use linguistic hedges like 
very small-length(Obj), and more-or-less small-length(Obj) for variations of the 
condition. Of course we can prepare a predicate for small-length(Obj) in normal 
logic, but the condition subjected by this predicate can not be modified. 

Unnecessary for introduction of variables to deal with values keeps a hypo- 
thetic clause short and also keeps the number of variables small. Literals intro- 
ducing variables for values are gain-less literal in many cases and are hard to be 
treated. Consequently this is a large advantage for ILP. 

Other attempts to extend ILP with fuzzy characteristic include [12] in which 
a system is developed and used for a control problems. It introduces linguistic 
variables which is an extension of normal variables and express vagueness. We 
doubt that linguistic variables spoil the advantage of FLP. Our system express 
vagueness by giving membership. We have already emphasized unnecessary of 
explicit use of variables in our system. 
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Abstract. One of the most important problems on rule induction meth- 
ods is that measures used for rule search will be influenced by missing 
values. In this paper, a new approach to missing values is introduced, 
called rough estimation of conditional probabilities. This technique uses 
three estimation strategies, ground mean, lower and upper methods. At- 
tributes which have missing values will be estimated by these methods 
and will be checked by constraints for probabilistic rules. The proposed 
method was evaluated on medical databases, the experimental results of 
which show that induced rules correctly represented experts’ knowledge. 



1 Introduction 

Rule induction methods have been introduced in order to extract and discover 
useful patterns from databases [4,5] by using heuristic search methods in which 
several kinds of measures or indices based on frequencies. One of the important 
problems is that these measures will be influenced by missing values, while this 
effect is lower than induction of decision trees [2]. In this paper, a new approach to 
missing values is introduced, called rough estimation of conditional probabilities. 
This technique uses three estimation strategies, ground mean, lower and upper 
estimation methods. The ground mean method assumes that a set of examples 
which have missing values reflects the frequency observed in a set of examples 
which do not have missing values. Lower estimation method assumes that all 
the cases having missing values will not belong to a target concept and upper 
estimation method assumes that all the cases with missing values will belong 
to a target concept. Attributes with missing values will be estimated by these 
methods and the estimated accuracy and coverage are used for induction of 
probabilistic rules. 

The proposed method was evaluated on medical databases, the experimental 
results of which show that induced rules correctly represented experts’ knowledge 
and several interesting patterns were discovered. 

2 Rough Set Theory and Probabilistic Rules 

In the following sections, we use the following notations of rough set theory [5]. 
First, a combination of attribute- value pairs, corresponding to a complex in AQ 
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terminology [4] , is denoted by a formula R. Secondly, a set of samples which sat- 
isfy R is denoted by [x\r, corresponding to a star in AQ terminology. Finally, U, 
which stands for “Universe”, denotes all training samples. Using these notations, 
we can define several characteristics of classification in the set-theoretic frame- 
work, such as classification accuracy and coverage. Classification accuracy and 
coverage(true positive rate) is defined as: 

aniD) = P(D\R)), and 

INfll 

= P{R\D)), 

where \A\ denotes the cardinality of a set A, au{D) denotes a classification 
accuracy of R as to classification of D, and kr{D) denotes a coverage, or a 
true positive rate of i? to U, respectively.^ It is notable that these two measures 
are equal to conditional probabilities: accuracy is a probability of D under the 
condition of R, coverage is one of R under the condition of D. 



Probabilistic Rules 

According to the definitions, probabilistic rules with high accuracy and coverage 
are defined as: 

R^ d s.t. R = ViRi = V Aj [aj = Vk], 

OiR,{D) > 5a and KRi{D) > 5^, 

where 5a and 5^ denote given thresholds for accuracy and coverage, respectively. 



3 Estimation of Accuracy and Coverage 

3.1 Problems about Missing Values 

According to the definitions of accuracy and coverage, it is easy to see that both 
measures are influenced by missing values, although coverage is more influenced 
than accuracy. Whereas accuracy is a function of and [x]r n U, coverage is 
a function of [x]rC) D. If an attribute in R has a missing value, then [x]rC) D 
will be smaller than the case when R does not have. In the case of accuracy, this 
effect will be smaller because is divided by [x]r. However, coverage will 

suffer from this effect because the intersection is divided by D. 

3.2 Missing Value Problems from the Viewpoint of Sets 

Let R and M denote an elementary attribute-value-pair [ai = Vj] and a set of 
examples missing the value of a^. Also, let Mo denote a set of examples which 
belongs to a class D and which also belongs to M (i.e., D n M .) For simplicity. 



^ Those measures are equivalent to confidence and support defined by Agrawal[l]. 
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we consider to be a categorical variable, although the discussion below can 
be extended to continuous variables. Then, ordinary accuracy and coverage are 
defined as: 

^ \[x]r^{D-Md)\ ^ \[x\R^{DC^{U-M))\ 

^ \[x]r\ \[x]Rn{U-M) + [x]RnM\- 

^ \[x]Rn{D-Mn)\ ^ \[x]Rn{Dn{U-M))\ 

’ \D\ \Dn{U-M)+DnM\' 

Estimation methods for accuracy and coverage can be viewed as interpretation 
of Md by using some assumptions on D. 

3.3 Ground Mean Measures 

Simple assumption, based on statistical methods, is that Mr reflects the fre- 
quency observed in 17 — M. For example, if the frequency of Z? in 17 — M is 50%, 
then the frequency of D in M can be assumed to be 50%. In this case, the 
formula above is modified into: 

\[x]rC^{D-Mr)\ \[x]r^{DcUU-M))\ 

= htt = im ■ 

^ \[x]rM{D-Mr)\ ^ |[x]^n(77n(Z7-M))| 

’ \D-M\ \Df^{U-M)\ 

This estimation will be referred to as the ground coverage. 



3.4 Lower Measures 



Another assumption is that all the cases having missing values will not belong to 
a target concept, which is the most pessimistic strategy. Two measures modified 
by this assumption, called lower accuracy and coverage are defined as: 



c^r(D) 



^r{^) 



\[x]RnD-MR)\ 

INfil 

|[a:]fl: n 7? — Mr)\ 

\D\ 



\[x]Rn{Dn{u-M))\ , 

ra " ^ 

\[x]Rn{Dn{u-M))\ 

\D\ 



3.5 Upper Measures 

The final assumption is that all the cases with missing values will belong to a 
target concept, which is the most optimistic strategy. The coverage modified by 
this assumption, called upper coverage is defined as: 

Urr.^ \[x]Rn{D-MD)\ + \MR\ \[x]Rn{Dn{U-M))\ + \MR\ 

^ INkI INfll 

\[x]RniD-MD)\ + \MR\ \[x]Rn{Dn{U-M))\ + \MR\ 

" \D\ " \D\ • 
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These results are from univariate analysis, but it is notable that since the defini- 
tion above is based on an assumption about relations between M and the 
extension of estimation into multivariate missing values is very easy: to apply 
the corresponding strategy to other attributes which have missing values. 

4 Rule Induction 

According to the definition of probabilistic rules and the characteristics of accu- 
racy and coverage, an algorithm for induction of classification rules is defined as 
Figure 1. 



procedure Induction of Probabilistic Rules; 

var 

i : integer; M, Li : List; 

begin 

Li := Ler', /* Ler- List of Elementary Relations */ 
i~l; M~{}; 

for i := 1 to n do /* n: Total number of attributes * / 

begin 

while ( Ti 7 ^ {} ) do 

begin 

Calculate and Estimate Accuracy and Coverage; 

Sort Li with respect to the value of coverage; 

Select one pair R = f\[ai = Vj] from Li, which have the largest value on 
coverage; Li ~ Li — {R}; 
if {kr{D) > (5„) 
then do 
if (aR{D) > Sc) 

then do Sir := Sir -|- {R}; /* Include R as Classification Rule */ 
M --M + {R}; 

end 

Li+i := (A list of the whole combination of the conjunction formulae in M); 

end 

end {Induction of Probabilistic Rules }; 

Fig. 1. An Algorithm for Probabilistic Rules 



5 Experimental Results and Discussion 

5.1 Performance of Rules Obtained 

For experimental evaluation, a new system, called PRIMEROSE-REX3 (Proba- 
bilistic Rule Induction Method for Rules of Expert System ver 3.0), is developed. 
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where the algorithms discussed in Section 4 are implemented. PRIMEROSE- 
REX3 was applied to the following medical domain: headache(RHINOS domain), 
whose training samples consist of 2175 samples, 10 classes and 41 attributes (5 
attributes have missing values). The experiments were performed by the fol- 
lowing three procedures. First, these samples were randomly splits into new 
training samples and new test samples. Second, using the new training samples, 
PRIMEROSE-REX3 induced rules with three estimation procedures. Third, the 
induced results were tested by the new test samples. These procedures were 
repeated for 100 times and average all the estimators over 100 trials. 

Experimental results are shown in Table 1. There results suggest that this 
new approach performs as well as experts’ rules. ^ 



Table 1. Experimental Results (Accuracy: Averaged) 



Method 


Headache 


GVD Meningitis 


Lower 


78.1% 


72.9% 


81.7% 


Groud Mean 


89.3% 


73.3% 


75.5% 


Upper 


86.6% 


75.4% 


77.2% 


Experts 


91.7% 


83.9% 


89.2% 
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Abstract. This paper is to expose the concealed sustainability informa- 
tion to users from UNDP annual report of human development database. 
Different from the “understanding database by computer”, the task is to 
tell user what meaning of the database is, and called “understanding 
database by human”. The decemibility matrix (DM) in rough sets is 
used to get a qualitative description of the database. Domain knowledge 
is embedded in DM, and the set of its attribute reduction is employed to 
develop the sustainable development indicators (SDI) and its applica- 
tions to the world’s sustainability assessment. 



Introduction 

The data in the UNDP’s reports of human development database (HDD) have been 
annually investigated, collected and analyzed by thousands of persons all over the 
world since 1990[1][2]. The database includes hundreds kinds of social, economic and 
natural data. It is used for comparison of the human sustainability of different coun- 
tries through the human development indicators (HDI) induced from HDD. 

However, the information abstracted from the HDD for construction of HDI is 
only concerned with life expectancy, education and GDP. While plenty of useful data 
about population, resource, environment and system context has not been used. As we 
may know, the sustainability of a country is connected with not only population qual- 
ity and economic level, but also social and economic structure, resource and environ- 
mental state, and economic and ecological function. This paper is to excavate addi- 
tional information from HDD for sustainability assessment, and to refine a new 
sustainability indicator after further reduction, which is more intuitive, easier to ex- 
plain and more operable. 
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The task is to refine a set of reduction attributes from HDD, so those domain ex- 
perts can understand the database. Different from the method based on machine 
learning, called “database understanding by computer” in this paper, it transforms the 
plain database language into human comprehendible language according to human 
demand, which is called “database understanding by human” [9]. It is one of the most 
significant tasks in KDD. The task can be divided into two related steps. The first is to 
find attributes reduction with domain knowledge. The second is to develop a sustain- 
able development indicators (SDI) function and give its explanation according to the 
reduction attribute set [4]. 

Except the ordinary statistic methods for data analysis, we need some priori exper- 
tise to identify the key factors of different countries’ SDI. According to the character- 
istics of sustainability study, decemibility matrix (DM)[8] in rough sets (RS)[6] is 
employed to get a qualitative description of the database [7]. This is firstly due to 
human development is a relative concept and only determined by development dis- 
crepancy among different countries, whereas DM method could just describe this 
discrepancy. Secondly, in order to evaluate differentiate countries’ SDI, we need only 
to choose a representative set from piles of the discrepancy data. This is just in accor- 
dance with DM’s reduction principles. This paper presents a modified DM with do- 
main knowledge on SDI and uses it to solve the problems on data mining of HDD. 



2 HDD Data and Pre-processing 

HDD includes more than 300 attributes and 175 countries’ data from 1990 to 1997, 
which can be classified as social, economic and environmental data, or structural and 
functional data. The first task is to reduce them into a succinct format so those users 
could easily understand and compare the human development state of different coun- 
tries. 

Because of the roughness, unevenness and incompleteness of the database, it is 
necessary to pre-process irregular and vacant data. The internal renewable water re- 
source per capita of most countries’ value, for example, is below 20, while in Iceland 
it is 624.5, 50 times higher than the average. It would wipe out the differentiation 
among most countries’ water resource if we use the real value directly. A threshold set 
by experts to adjust the scaling for each of this irregular data in order to enlarge the 
differentiation. 

The simplified attribute set is not enough for a better understanding of HDD. We 
need a new kind of language for refining of the sustainability indicator, which should 
be more understandable by common sense. An appropriate variation of the simplified 
attribute set has been carried out. For each of the nine attributes, we created a new 
SDI attribute table with different years’ and countries’ data, made statistical analyses 
and time series analyses for each table. Special attention was given to those irregular 
data and a revised attribute table, after irregular date filtering, error correction and 
vacancy substitution was obtained. Based on the revised attribute table, each year’s 
revised HDD was reassembled. Finally, the annual SDI value of each country were 
calculated and ranked based on the revised HDD. The average SDI of every country 
in 90’s and their tendency were got through statistical and regression analyses. 
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The limiting factors and promoting factors of each countries’ human development 
could be identified according to whether the SDI value is less than 0.15 or larger 
than 0.85. Based on these factors, risk and opportunity assessment for each country 
can be carried out. The tendency of each country’s SDI could be identified according 
to the results of regression analysis of each SDI during 1991-1997. When the slope is 
larger than -1-0.15%, the SDI is ascending, when the slope is less than -0.15%, the SDI 
is descending, when the slope is between -1-0.15% and -0.15%, the SDI is stabilizing. 

3 A Modified DM Principle with Domain Knowledge 

There are two steps in DM: finding DM and reduction. Domain knowledge is needed 
in both processes. As rough sets will result in a too fine division in the case of con- 
tinuous variable and therefore affect the feasibility of reduction, a mapping of con- 
tinuous variable set to discrete one is made firstly in the permission of domain knowl- 
edge. In addition, due to sustainability is a systematic concept concerned with effi- 
ciency, equity and vitality in all aspects of social, economic and ecological develop- 
ment, the final result of HDD reduction should cover all of these areas [3]. 

According to RS[7], the table can be defined as: S=<U, A, V, f>, where U is the 
universe. A is the set of attributes on U, V= V^, where is the value set of the at- 
tribute a, f is an information function. To embed domain knowledge to DM, its item 
is defined as: = {a: (i, Va)i^ha (j, Va), Va, ae A} where ha (.) is called human de- 

velop function, defined as a mapping: R— ^{0,1,2, 3,4,5, 6, 7, 8, 9}. A measurement of 
attribute significance is used in attribute reduction. The roughness based attribute 
significance is not a good index to discriminate different attributes in dealing with 
non-confiictive and non-overlapping database, such as HDD. We define the attribute 
significance y(a) as the ratio of the attribute appearance frequency Frq(a)[5][6] to the 
total number K of items in DM. y(a)=Frq(a)/K, a= ATTR(Max {y(x), Vx, xgA}), 
where a is the most important attribute. It is selected as a reduction attribute. 

Obviously, this frequency can be directly found from U/a or acquired from count- 
ing the number of attributes appeared in DM. The attribute with maximum value will 
be as a reduction attribute. However, it is not enough, the domain requires keeping 
balance in all of the social, economic and environmental aspects, an alternate strategy 
is developed as follows: 

Let A={S, P, E}. S, P, E represents social, economic and environmental attribute 
subsets respectively. When reduction. The most important attributes s, p, e for S, P, E 
respectively will be selected as reduction attributes. HDD attributes reduction algo- 
rithm can be described as: 

1 . M is a DM of HDD, and Co is the set of core in M. Let R=Co. 

2. Q={Aij:AijnR^0, i^j, i,j=l,2,...,n}, M=M-Q, B=A-R; 

3. To all Sk, p k, 6kGB in S, P, E respectively, finding p(Sk), p(Pk), p(ek) and let 
p(Sq)=max{p(Sk)}, p(pq)=max{p(pk)}, p(eq)=max(p(ek)); 

4. R<=RU{Sq, pq, eq}; 

5. Repeat the above processes, until M=0. 

Set R is a set of reduction attribute of above information table S (HDD). Because 
there was no core in HDD, the initiate R is empty. 
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4 Formulating SDI from HDD Data Mining 

One objective of HDD data mining is to identify a more succinct attribute set SDL 
The index system would not be rationale and applicable if the reduction process was 
going on independently year by year. To make efficient use of every year’s HDD 
information, we firstly reduce the attributes year by year, then select those common 
attributes as the candidate index in SDI that appeared in most years’ result. According 
to the DM principles, these attributes must be the most suitable attributes in HDD that 
represent the counties’ characteristics, because the number of these discrepancy items 
in DM is biggest. 

But only depending on the countries’ discrepancy in certain attributes can not ap- 
propriately determine the SDI index system. Some key attribute may not be chosen in 
domain knowledge-independent process. The appearance frequency of GDP attribute, 
for example, is only 50% and can not meet our reduction rules, though it is a very 
important attribute in construction SDI. According to experts’ knowledge, some most 
important attributes were labeled fade kernel and will have the priority to enter in the 
final reduction. In our work. Life expectancy at birth and GDP per capita were the 
fade kernels suggested by experts and entered into the reduction result. 

5 Application of SDI and Database Understanding by Hnman 

The visible results of the sustainability knowledge mining from HDD are: 

1. Through statistical analyses, we got the SDI Tendency of different counties in the 
world. The average value of SDI all over the world is 0.45 from 1991-1997. The 
SDI value of strongest country is 5 times higher than that of weakest country. 

2. The regional SDI rank from high to low is: North America, West Europe, East 
Europe, South America, Middle America, Pacific and Oceanic, East Asia, West 
Asia, South Asia and Africa. With the highest SDI is in North America (0.6274), 
and lowest is in Africa (0.3007), less than half of the North America. 

3. The SDI values in 29% countries of the world are stronger. Three strongest coun- 
tries are Canada, Sweden and Norway with SDI value of 0.740, 0.71 1 and 0.708 re- 
spectively. 

4. The SDI values in 42% countries of the world are in middle level. In these coun- 
tries, 52% are above the world's average, while 48% are below it. 

5. The SDI values in 29% countries of the world are weaker. 82% of these countries 
locate at African. The three weakest countries are Afghanistan, Somali and Mali 
with SDI value of 0.1 17, 0.128 and 0.152 respectively. They are all less developing 
countries with poor resource and heavy social and natural disaster. 

6. In the past eight years, the SDI value in 58.6% countries of the world are ascend- 
ing, 16.1% countries are relatively stabilizing, and 25.3% are descending. While 
the SDI value of all former-USSR countries are declining. 

7. The SDI values in 75% of the less sustainable developing countries (below aver- 
age) are ascending, 12% are stabilizing and only 13% are descending. 

8. The above result was calculated from UNDP’s HDD (1991-1997), based on data 
before 1994. The conclusions only reflect the world’s condition before 1994. 
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6 Conclusion 

This paper exposes the concealed sustainability information to users from the UNDP 
annual report of Human Development Database via KDD. As users are mainly con- 
cerned with the dynamics and differences of various countries’ development, and the 
experts’ knowledge is critical in refining sustainability index, a modified decemibility 
matrix (DM) with domain knowledge is developed and used to acquire a succinct 
attribute reduction. 

The results of data mining for HDD are as fellows: 

1 . Expansion of HDI into SDI, which includes more sustainability information from 
the same database of KDD. 

2. Mining dynamic information from different year’s HDD, and ascertaining the SDI 
change tendency. 

3. Identification of limiting factors and promoting factors in each country. 

4. Ascertained some data errors and uncertainties in some countries’ data. 

5. Comparison of the sustainability of different countries and regions in the world. 

6. Providing a spatial and temporal map of world’s sustainability based on the HDD 
data from 1991 to 1997. 
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Abstract. “Ripple Down Rules (RDR)” Method is one of the promis- 
ing approaches to directly acquire and encode knowledge from human 
experts. It requires data to be supplied incrementally to the knowledge- 
base being constructed and new piece of knowledge is added as an ex- 
ception to the existing knowledge. Because of this patching principle, 
the knowledge acquired strongly depends on what is given as the default 
knowledge. Further, data are often noisy and we want the RDR noise 
resistant. This paper reports experimental results about the effect of the 
selection of default knowledge and the amount of noise in data on the 
performance of RDR using a simulated expert. The best default knowl- 
edge is characterized as the class knowledge that maximizes the minimum 
description length to encode rules and misclassified cases. This criterion 
also holds even when the data are noisy. 



1 Introduction 

Implicit assumption of the development of knowledge-based systems is that the 
knowledge of expertise is stable and it is worth investing much on knowledge 
acquisition from human experts. However, the rapid innovation of technology 
in recent years makes the existing knowledge out-of-date so soon and requests 
frequent update. In other words, we should now think that the knowledge of 
expertise is not static but dynamic in real world problems. In addition, the 
advancement of worldwide networks such as Internet has changed the computer 
usage practice drastically. It is now possible that many users have access to a 
single knowledge-based system through a network [9], and further, the needs 
arise that multiple experts supply the new knowledge continuously through the 
same network. 

Under these circumstances, a new methodology to construct knowledge-based 
systems for continuous changes is in great demand. Knowledge is value-added 
data and includes such things as experience, insight and skill. “(Multiple Class) 
Ripple Down Rules ((MC)RDR)” [1,8]^ is one of the promising approaches to 
realize such knowledge-based systems and to directly acquire and encode knowl- 
edge from human experts. (MC)RDR is a performance system that does not 

^ In this paper, we limit our analysis to RDR. 



N. Zhong and L. Zhou (Eds.): PAKDD’99, LNAI 1574, pp. 284—295, 1999. 
(c) Springer- Verlag Berlin Heidelberg 1999 
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require high level models of knowledge as a prerequisite to knowledge acqui- 
sition, and has been shown very effective in knowledge maintenance as well as 
acquisition of reflexive knowledge. Recent advancement shows that MCRDR can 
be applied to a configuration task [15]. Handling multiple sources by MCRDR 
has not yet been resolved. 

RDR exclusively relies the knowledge to acquire on human experts and one 
of the advantages is that it does not require any data statistics as machine 
learning techniques do. Use of this knowledge, if available, may enhance RDR’s 
performance. Default knowledge is one such knowledge that requires statistics 
to characterize itself. In this paper we try to identify what characterizes good 
default knowledge when the data available have noise, and show that selection of 
good default knowledge contributes to building a noise-resistant compact knowl- 
edge base. 



2 Ripple Down Rules Revisited 

Ripple Down Rules (RDR) is a knowledge acquisition technique that challenges 
the KA bottleneck problem by allowing rapid development of knowledge bases by 
human experts without the need of analysis or intervention of a knowledge en- 
gineer (KE). From long experience of knowledge-based systems development [1], 
it was made clear that human experts are not good at providing information on 
how they reach the conclusions, rather they can justify that their conclusions 
are correct [2,4]. Its basis is the maintenance and retrieval of cases. It tries to use 
the historical way in which the expert provides his/her expertise to justify the 
system’s judgements [7]. The cases and associated justifications (rules) are added 
incrementally when a case is misclassified in the retrieval process. This is similar 
to “failure-driven memory” which was introduced by Schank [16]. Experts are 
very good at justifying their conclusions about a case in terms of differences from 
other cases [10]. When a case is incorrectly retrieved by an RDR system, the 
knowledge acquisition (maintenance) process requires the expert to identify how 
the case differs from the present case. This has a similar motivation to work on 
personal construct psychology where identifying differences is a key strategy [6] . 
The notion of on-going refinement by differentiating cases is a useful model of 
human episodic memory. Thus, RDR has reduced knowledge acquisition to a 
simple task of assigning a conclusion to a case and choosing the differentiating 
features between the current misclassified case and previously correctly classified 
case. 

Each time a rule is added incrementally when a case is misclassified, the case 
that prompted the rule is stored (called the “cornerstone case”). Dependening 
on whether the last satisfied node is the same as the end node, the new rule 
and its cornerstone case are added at the end of YES or No branch of the end 
node. Knowledge is never removed or changed, simply modified by the addition 
of exception rules. The tree structure of the knowledge and the keeping of cor- 
nerstone cases ensure that the knowledge is used always in the same context 
when it was added. 
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Fig. 1. Knowledge structure of Ripple Down Rules 



The tree structure of RDR Knowledge Base is shown in Figure l.(a). Each 
node in the binary tree is a rule with desired conclusions. Each node has a 
cornerstone case associated with it, the case that prompted the inclusion of the 
rule. The classification (conclusion) comes from the last rule that was satisfied 
by the current data. The knowledge acquisition process in RDR is illustrated in 
Figure l.(b). To add a new rule into the tree structure, it is necessary to identify 
the location and the condition of the rule. When the expert wants to add a new 
rule, there must be a case that is misclassified by a rule which was applicable 
to the current case. The system asks the expert to select conditions from the 
difference list between these two cases: misclassfied case and the cornerstone case. 
Then the misclassified case is stored as the refinement case (new cornerstone 
case) with the new rule whose condition part distinguishes these cases. 



3 Default Knowledge in Ripple Down Rules 

The rule at the root node in RDR Knowledge Base, which is represented by a 
binary tree, is special. Its condition part is empty (it means that it is always 
satisfied) and its consequent part is a special conclusion called “default class” . 
Here, the default class means a tacit conclusion for the current case when no 
other conclusion of the case is derived from the knowledge base. 

Suppose that a conclusion A is set as the default class in RDR and a training 
case which has the conclusion A is supplied to the system. Even if no rule in the 
knowledge base fires for the case, the system correctly classifies the case using the 
default class A. Although the classification is correct, no explicit knowledge that 
characterizes the conclusion A is acquired. It is simply saying that since there is 
no conflicting knowledge, the conclusion must be A. Therefore, a different setting 
of the default class yields a different knowledge base with different performance 
(accuracy and size) even for the same problem domain. 
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The root node has only YES branch (Figure l.(a) and (b)), while all the 
other nodes have both YES and NO branches. All rules below the root node are 
the exception rules of the root node. More generally, all rules below the YES 
branch of a certain node is the exception rules of that node. In the previous 
research it was common that the majority class in the domain was assumed 
as the default class [8]. Majority class in the domain means the class of the 
most frequent cases in the data set. This looks natural because we only need to 
acquire rules (knowledge) for the other (minority) classes as the exceptions of 
the default class and some subsequent refinement rules for the default class. The 
size of the knowledge base is expected to be compact. The majority class does 
not necessarily mean that it is easy to describe and characterize that class. The 
same is true for non-majority classes. 

It is worth investigating how differently an RDR system behaves in terms of 
accuracy and size when a different setting of the default class is made. There 
are the same number of settings of the default class as are the number of classes 
in the domain. We discuss the reason why and how different performance of 
an RDR system is obtained with reference to the knowledge characterization of 
an expert. In this study a simulated expert is used as a replacement of human 
expert [-5]. 

Another important aspect which the study of RDR so far has not addressed 
much is noise handling of data. It is always assumed that an human expert is 
available and judges the quality of data. In real world problems, noise is abound 
and we cannot get rid of it. We have to admit that human experts make mistakes. 

It is known that one of the merits of RDR is that it can acquire the majority 
of knowledge at an early stage of knowledge acquisition compared by a standard 
empirical machine learning approach [11]. Intervention of human expert is the 
source of this high performance. How will these good properties be affected or 
not affected with the introduction of noise and the selection of default class is 
the main focus of the current study. 



4 Experiments 

4.1 Data Sets 

We have selected 15 data sets (7 data sets with all nominal attributes, 7 data sets 
with all numerical attributes and 1 data set with mixed attributes, see Table 1) 
from University of California Irvine Data Repository [12]. 

4.2 Experimental Method 

We have conducted parametric studies using a simulated expert [5] for the above 
data sets. For each default class used, we have evaluated the classification ac- 
curacy and knowledge base size with different noise levels and different training 
data size. The results are compared with respect to these parameters and with 
other machine learning method. 
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Table 1. Summary of 15 data sets 



Name 


Case 


Clast 


Attributes 


Name 


Case 


Clast 


Attributes 


Car 


1728 


4 


Nom.* 6 


Pendigits 


10992 


10 


Num.“ 16 


Tic-Tac-Toe 


958 


2 


Nom. 9 


Iris 


150 


3 


Num. 4 


Nursery 


12960 


5 


Nom. 8 


Page-block 


5473 


5 


Num. 10 


Connect-4 


16889 


3 


Nom. 42 


Optdigits 


5620 


10 


Num. 64 


Mushroom 


8124 


2 


Nom. 22 


Yeast 


1484 


10 


Num. 8 


Monk 


1711 


2 


Nom. 6 


Waveform 


5000 


3 


Num. 21 


Splice 


3190 


3 


Nom. 60 


Image 


2310 


7 


Num. 19 










Ann-thyroid 


7200 


3 


Mixed 15/6*“ 



‘nominal attribute, “numerical attribute,*** 15 nominal/6 numerical 



Simulated expert As a simulated expert, we use C4.5 [13]. We run C4.5 with 
the default setting over an entire data for each noise level aird for each data set. 
This is the maximum attainable performance that can be induced from the given 
data and we use this as the knowledge (expertise) of an expert. Here, we use the 
induction rule set which is generated from the decision tree derived by C4.5. 

While a human expert checks the output of the RDR system, and when 
the conclusion is wrong, generates a new rule by selecting conditions from the 
difference list , a simulated expert similarly selects coirditions from the differeirce 
list based oir machiire leariring techiriques. All conditions from the intersectioir 
of C4.5 rule coirditions for the case and the difference list for the case are selected 
in this study. It should not be expected that the simulated expert will perform 
better than a real human expert. For noisy data, the simulated expert may 
misclassify some of the cases. Though a human expert may perform better than 
a simulated expert in this case, we adopt the classification by the simulated 
expert as the conclusion made by the expert. It is also noted that the simulated 
expert has small error rate even when the data is noise-free. This is because 
pruning is automatically made in C4.5. 

Data preparation Since RDR accepts data sequentially, the order of incoming 
data affects the performance. The whole data for each domain was reordered by 
random sampling and ten different data sets are generated to cancel out this 
ordering effect. For each data set of a fixed order, the first 75% of the data is 
taken as training data and the remaining 25% of the data as test data. To see 
the effect of how rapidly RDR can acquire the correct knowledge, the predicted 
error rate of test data is periodically evaluated at the points where 1%, 2%, 4%, 
5%, 7%, 10%, 15%, 20%, 30% 45%, 50%, 60% and 75% (all) of the data are used. 

Accuracy The predicted error rate is used as a measure of accuracy. It is the 
average of the error rates of the remaining 25% data of the ten data sets, each 
with a different ordering. 
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Noise handling Noise handling is different for nominal attributes and numeri- 
cal attributes. For nominal attributes, the feature values were changed randomly 
to other alternatives with a specified fraction (noise level). For numerical at- 
tributes, noise of standard deviation equal to the specified fraction (noise level) 
of each attribute value range was randomly added assuming Gaussian distri- 
bution. Thenoise was added only to the attribute values and not to the class 
information with a fraction of 0%(without noise) 2%, 4%, 6%, 8%, 10%, 12%, 
14%, 16%, 18% and 20%. The evaluation was performed at two points where 20% 
and 50% of the data were seen for both kinds of attributes. This is to see the 
effect of noise both at an earlier stage of knowledge acquisition where the system 
has not yet seen enough data and a later stage where the system’s performance 
is approaching to an equilibrium. 

Machine learning method The RDR knowledge base is compared with the 
knowledge base built using a standard machine learning method C4.5 [13]. Since 
the learning by C4.5 is not incremental, we run C4.5 at the same data point 
where we check the accuracy of RDR knowledge base. This means that at an 
early stage, C4.5 must learn with small data set. The test data set is the same 
as the one used for RDR testing (the last 25% data). 

Size of knowledge base The size of the knowledge base is defined respectively to 
be the sum of the decision node in RDR binary tree and the number of induction 
rules in C4.5. This is because a decision node in RDR corresponds to an if-then 
rule. 

4.3 Results 

Due to the space limitation, we only show the detailed results of Car data set 
out of the 15 data sets, and the summary results of the whole data sets. 

Table 2 summarizes the data characteristics and the main results. This data 
set has 6 nominal attributes and 4 classifications which are unacc (unacceptable), 
ace (acceptable), good and vgood (very good). The first three rows are 1) the 
number of cases for each class (a total of 1728 cases showing unacc is the majority 
class), 2) the number of induction rules for each class that was derived from 
the C4.5 decision tree, and 3) the error rate of the simulated expert C4.5 (as 
calculated by the command C4-5rules) for each class. 

Figures 2 and 3 show the results of the machine learning method C4.5 and the 
four kinds of RDR each with different default class when trained by various data 
size in Car data set. The graphs respectively show the improving performance 
(error rate) and the increasing size of the knowledge base as the more data 
are available. From Figure 2, it is seen that the learning speed of RDRs are 
faster than that of C4.5. This is because RDR acquires the knowledge from the 
simulated expert who has enough knowledge of all cases in the problem domain. 
This is consistent with the results given in [11]. 

Among the four kinds of RDR, the learning speed of RDR( acc) whose default 
class is acc is fastest, and the size of the knowledge base of RDR (acc) is smallest 
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as can be seen in the corresponding learning curves. It should be noted that 
RDR which has the majority class as the default class does not necessarily show 
the best performance. The majority class in this data set is unacc. This is also 
summarized in Table 2. Since it is not necessary to acquire the knowledge about 
the default class in RDR, RDR( acc) does not acquire the knowledge about the 
class acc but acquires the knowledge about the other classes, more precisely the 
knowledge to differentiate these classes from the class acc. The fifth, sixth and 
seventh rows show the ranking of the accuracy, the learning speed and the size of 
knowledge base for each default class. Note that the error rate of the simulated 
expert is lowest and the number of rules that are required to describe the class is 
largest for the class acc, which ranks the first. It means that RDR(acc) quickly 
acquires the knowledge about the other classes which happens to have higher 
error rates and do not require large description length. It does not acquire the 
explicit knowledge about the default class acc which has low error rate. The 
curves for the other defaults {unacc, good, vgood) are nearly the same, vgood 
being the worst. In Table 2 these are ranked either 2 or 3. The default class 
vgood has the highest error rate of the simulated expert and has the second 
smallest rule size to describe the class. 



Table 2. Car Evaluation Database 



Class 


unacc 


acc 


good 


vgood 


Number of cases 


1210 


384 


69 


65 


Number of induction rules 


11 


53 


15 


15 


Error rate of simulated expert (%) 


2.7 


1.3 


7.7 


15.4 


Minimum description length (bits) 


261.4 


638.2 


271.3 


232.1 


Accuracy (ranking *) 


2 


1 


3 


3 


Acquisition speed (ranking*) 


3 


1 


2 


3 


Size of knowledge base (ranking*) 


2 


1 


2 


2 



* Smaller, the better. Both the accuracy and the size are evaluated at 75% of the 
total data seen. The learning speed is evaluated at 20% of the total data seen. When 
the difference is small, the rank is rated the same. 



From this observation, we can conjecture that the more induction rules are 
required to describe the class and the lower the error rate of the simulated expert 
for the default class is, the faster the learning speed of RDR is and the smaller the 
size of RDR knowledge base is. By having a class that requires more induction 
rules to describe itself as the default we can avoid adding these many rules later 
in the knowledge acquisition process. Likewise by having a class for which the 
predicted error is smaller, the refinement rule for the default class can be more 
error free and the probability of recursive exception taking place would be less. 
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Fig. 2. Error rate for various sized 
training sets in Car data set. 
RDR( wnacc), for example, means 
RDR whose default class is unacc. 



Fig. 3. Size of the knowledge base for 
various sized training sets in Car data 
set 



In order to quantify the above observation, we introduce a new index to 
consider both the number of rules and the error (misclassification) rate of the 
simulated expert for each default class, namely ‘‘‘‘minimum description length!' as, 
shown in the fourth row in Tables 2. 

This index is calculated as the sum of the coding costs for induction rules and 
misclassified cases. The number of bits required to encode the induction rules is 

N Mi 

log 2 Pij) - log 2 (Md) } - log 2 (iV!) 

i=l j=l 

where N is the number of rules. Mi is the number of conditions in the rule i?,, pij 
is the probability for which the j-th condition is satisfied in the rule Ri for all 
training cases. The number of bits required to encode the misclassified cases is 




where the rules cover r of the n training cases with fp false positives and fn 
false negatives. Here, we used a weighting factor of 0.5 to the former because 
it is known that the rule coding length as defined in the above equation is 
overestimated. This value is the default setting of C4.5. 

The total coding length for rules and exceptions is minimized for each class to 
find the best subset of rules^ . The result of Table 2 suggests that RDR performs 
best in terms of the speed of knowledge acquisition and the size of knowledge 
base when the default is taken to be the class for which the minimum descrip- 
tion length is maximum {heuristic principle of maximum minimum description 
length). It is also shown that RDR behaves worst in terms of its accuracy and 

This is what C4.5 does in inducing rules from the decision tree. 



2 
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Fig. 4. Error rates vs. minimum de- 
scription lengths of the four RDRs for 
three different noise levels: 0%,10% and 
20% at the point where the first 20% of 
data are seen in Car data set. 



25 % 




0% 1 ' ■ ' ■ ■ ' 

0 200 400 600 800 1000 1200 

Minimum description length 



Fig. 5. Error rates vs. minimum de- 
scription lengths of the four RDRs for 
three different noise levels: 0%,10% and 
20% at the point where the first 50% of 
data are seen in Car data set. 



size when the default is taken to be the class for which the minimum description 
length is minimum. This relation holds for noisy data as discussed below. 

Figures 4 and 5 show the relation between the accuracies and the minimum 
description lengths of the four RDRs for three different noise levels: 0%,10% 
and 20% at the point where the first 20% and 50% of data are seen respectively 
in Car data set. It is observed that the error rates get smaller as the more data 
are seen for all noise levels and for all default classes, and the error rates of all 
four RDRs become higher as the data set becomes more noisy. However, RDR 
which has the maximum minimum description length has the best accuracy out 
of four RDRs on for all noise levels. The same results are obtained for the other 
data sets. We can conclude thet even when the data set is noisy, it is certain 
that RDR with the default class which has the maximum minimum description 
length has good accuracy. 

Tables 3 and 4 summarize whether the heuristic criterion of maximum mini- 
mum description length (MMDL) is actually appropriate for all the fifteen data 
sets. From Table 3, it is seen that MMDL class is the best for 14 (including the 
data sets with Yes*) out of 15 data sets with respect to accuracy, for 13 out of 
15 data sets with respect to acquisition speed and for 10 out of 15 data sets with 
respect to knowledge base size. If we relax the MMDL to be within the top 30%, 
we see that 12 out of 15 data sets are explained for all aspects (accuracy, speed, 
size) by this measure. In Table 4, the effect of noise on accuracy is shown for 
two situations where 20% and 50% of the data are seen. The level of noise added 
is 10% for both. Similar results are obtained for speed and KB size. We can 
conclude that the MMDL principle hold ever for noisy data. 

We have conducted additional evaluation where we used the description 
length as calculated by C4.5 when only the first 10% of the data is seen. The 
result indicates that the relative values of the description lengh can well be esti- 
mated by using a small fraction of data. This means that we only need wait until 
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Table 3. Summary (without noise) 



Name 


Accuracy 


Speed 


KB size 


summary 


MMDL 


<30% 


MMDL 


<30% 


MMDL 


<30% 


Car 


Yes 




Yes 




Yes 




Yes 


Tic-Tac-Toe 


Yes 




Yes 




Yes 




Yes 


Nursery 


Yes* 




Yes 




Yes 




Yes 


Connect-4 


Yes 




Yes 




Yes 




Yes 


Mushroom 


Yes* 




Yes 




Yes 




Yes 


Monk 


Yes 




Yes* 




Yes 




Yes 


Splice 


Yes* 




No 


No 


No 


No 


No 


Ann-thyroid 


Yes 




Yes 




Yes 




Yes 


Pendigits 


Yes* 




Yes* 




No 


Yes 


Yes 


Iris 


Yes* 




Yes* 




No 


No 


No 


Page-block 


Yes 




Yes 




Yes 




Yes 


Optdigits 


Yes 




Yes* 




Yes 




Yes 


Yeast 


Yes 




Yes 




Yes 




Yes 


Waveform 


No 


No 


No 


No 


No 


No 


No 


Image 


Yes 




Yes 




No 


Yes 


Yes 




14/15 




13/15 




10/15 


12/15 


12/15 



* means that there is very little difference between the best default class and the 
MMDL default class, and < 30% means to check whether the MMDL class ranks 
within the top 30%. 



a small amount of data is accumulated before initiating knowledge acquisition 
using the RDR method. 

As RDR is a methodology to acquire the useful knowledge from vague and 
disordered knowledge of a human expert by forcing him/her to place in a specific 
context, it may not be the case that the expert knows clearly the error rate for 
each class in advance. However, there must be some difference in certainty or 
confidence for the knowledge which the expert has for each class, and the result 
in this section shows that it is better to use the most informative (confident) 
class for the expert as the default class in RDR. This dose not necessarily mean 
to use the majority class. The proposed measure is based on the qualitative 
analysis of the RDR tree, and is certainly heuristic, but it appears that this can 
be a useful measure to characterize the goodness of default class. 



5 Conclusion 

This paper demonstrated experimental results about the effect of the selection 
of default knowledge and the amount of noise in data on the performance of 
RDR which is a knowledge acquisition technique without the need of analysis 
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Table 4. Summary (with 10%noise) 



Name 


Accuracy (training20%) 


Accuracy (training50%) 


summary 


MMDL 


<30% 


MMDL 


<30% 


Car 


Yes 




Yes 




Yes 


Tic-Tac-Toe 


Yes 




Yes 




Yes 


Nursery 


Yes 




Yes 




Yes 


Connect-4 


Yes 




Yes 




Yes 


Mushroom 


Yes 




Yes* 




Yes 


Monk 


Yes* 




Yes* 




Yes 


Splice 


Yes 




Yes 




Yes 


Ann-thyroid 


Yes 




Yes 




Yes 


Pendigits 


No 


Yes 


No 


No 


No 


Iris 


Yes* 




Yes* 




Yes 


Page-block 


Yes 




Yes 




Yes 


Optdigits 


No 


Yes 


Yes* 




Yes 


Yeast 


Yes 




Yes 




Yes 


Waveform 


No 


No 


No 


No 


No 


Image 


No 


No 


No 


No 


No 




11/15 


13/15 


12/15 




12/15 



* means that there is very little difference between the best default class and the 
MMDL default class, and < 30% means to check whether the MMDL class ranks 
within the top 30%. 



or intervention of a knowledge engineer. Because of the patching principle in 
RDR, the knowledge acquired strongly depends on what is given as the default 
knowledge. 

It is clearly shown that RDR shows good performance in terms of its accu- 
racy and size when the default is taken to be the class for which the minimum 
description length of induction rules and error cases is maximum {heuristic prin- 
ciple of maximum minimum description length) . This is the situation where the 
default class in RDR is taken to be the class for which the number of rules 
to describe is largest and the error rate of the simulated expert is lowest. This 
property also holds even when the data are noisy. 

The experiments also show that the good properties of RDR that it can 
acquire the majority of knowledge at an early stage of knowledge acquisition 
faster than a standard empirical machine learning approach also holds for noisy 
data. 

The above characteristics of RDR will be favorable to build a knowledge- 
bases system which assumes a network environment, where multiple users and 
experts interact with each other through the computer network. Acquiring or 
updating knowledge from multiple experts in a network environment is a new 
challenge. 
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Abstract. This paper investigates boosting naive Bayesian classifica- 
tion. It first shows that boosting cannot improve the accuracy of the 
naive Bayesian classifier on average in a set of natural domains. By an- 
alyzing the reasons of boosting’s failures, we propose to introduce tree 
structures into naive Bayesian classification to improve the performance 
of boosting when working with naive Bayesian classification. The experi- 
mental results show that although introducing tree structures into naive 
Bayesian classification increases the average error of naive Bayesian clas- 
sification for individual models, boosting naive Bayesian classifiers with 
tree structures can achieve significantly lower average error than the 
naive Bayesian classifier, providing a method of successfully applying 
the boosting technique to naive Bayesian classification. 



1 Introduction 

For many KDD applications, the prediction (classification) accuracy is the pri- 
mary concern. Recent studies on the boosting technique have brought a great 
success for increasing prediction accuracy of classifier learning algorithms [1, 9, 
15, 16]. Boosting induces multiple classifiers in sequential trials by adaptively 
changing the distribution of the training set based on the performance of previ- 
ously created classifiers. At the end of each trial, instance weights are adjusted 
to reflect the importance of each training example for the next induction trial. 
The objective of the adjustment is to increase the weights of misclassified train- 
ing examples. Change of instance weights causes the learner to concentrate on 
different training examples in different trials, thus resulting in different classi- 
fiers. Finally, the individual classifiers are combined through voting to form a 
composite classifier. Bauer and Kohavi [1] report that boosting for decision tree 
learning achieves a relative error reduction of 27% in a set of domains. Although 
boosting occasionally increases the error of decision tree learning in some do- 
mains, all previous studies showed that it can significantly reduce the error of 
decision tree learning in the majority of domains [1, 3, 15]. 

On the other hand. Naive Bayesian classifier learning [7] is simple and fast. 
It has been shown that in many domains the prediction accuracy of the naive 
Bayesian classifier compares surprisingly well with that of other more complex 
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learning algorithms such as decision tree learning, rule learning, and instance- 
based learning algorithms [5, 6, 12]. In addition, naive Bayesian classifier learning 
is robust to noise and irrelevant attributes. Therefore, it is interesting to explore 
how boosting affects Naive Bayesian classification. 

In sharp contrast to boosting decision trees, our experiments show that boost- 
ing gets only 1% of relative error reduction over the naive Bayesian classifier on 
average in a set of domains from UCI [2]. The average error rate of the boosted 
naive Bayesian classifier over these domains is even higher than that of the naive 
Bayesian classifier. In quite a number of domains, boosting cannot increase, or 
even decreases, the accuracy of naive Bayesian classifier learning. We are inter- 
ested in investigating the reason(s) why boosting performs poorly with naive 
Bayesian classifiers and the ways to improve its performance. 

One key difference between naive Bayesian classifiers and decision trees, with 
respect to boosting, is the stability of the classifiers — naive Bayesian classifier 
learning is relatively stable with respect to small changes to training data, but 
decision tree learning is not. Combining these two techniques will result in a 
learning method that produces more unstable classifiers than naive Bayesian 
classifiers. We expect it will improve the performance of boosting for naive 
Bayesian classification. 

The hybrid approach generates naive Bayesian trees. Kohavi [10] proposes 
such a hybrid approach, called NBTree. The purpose is to improve the accuracy 
of the naive Bayesian classifier by alleviating the attribute inter-dependence 
problem of naive Bayesian classification. 

Here, we want to explore whether and how the introduction of tree structures 
into naive Bayesian classification affects the performance of boosting for naive 
Bayesian classification. For this purpose, we use levelled naive Bayesian trees, 
which generates tree structures to a pre-defined maximum depth. The following 
section briefly presents the boosting technique including the details in our imple- 
mentation. Section 3 describes the naive Bayesian classifier learning algorithm 
and the levelled naive Bayesian tree learning algorithm that are used as the base 
learners for boosting in this study. Section 4 empirically studies the effects of 
introducing tree structures into naive Bayesian classification on the performance 
of boosting for naive Bayesian classification. We conclude in the final section. 
The complete version of this paper can be found in Ting and Zheng [17]. 

2 Boosting 

The basic idea of boosting has been briefly described at the very beginning of this 
paper. In our implementation, boosting uses reweighting rather than resampling 
[9, 15]. The boosting procedure is as follows. The weight adjustment formulas in 
step (iii) below are from a new version of boosting [16]. 
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Boosting procedure: Given a base learner and a training set T containing 
N examples, Wk{n) denotes the weight of the nth example at the fcth trial, where 
wi{n) = l/N for every n. At each trial k = 1,. . . ,K, the following three steps 
are carried out. 



(i) A model Mk is constructed using the base learner from the training set under 
the weight distribution Wk- 

(ii) T is classified using the model M^. Let S{n) = 1 if the nth example in T is 
classified incorrectly; <5(n) = 0 otherwise. The error rate of this model, eu, is 
defined as 

Ck = '^Wk{n)S{n). ( 1 ) 

n 

If Ck > 0.5 or Ck = 0, then Wk{n) is re-initialized using bootstrap sampling 
and continue the boosting process from step (i). 

(iii) The weight vector w^k+i) for the next trial is created from Wk as follows: 



where the normalizing term Zk and 



exp{-ak{-iy^^'') 

W(k+i){n} = Wk{n) 



(2) 



= 2^(l-ek)ek, ak = (3) 

After K trials, the models Mi, . . . , Mk are combined to form a single composite 
classifier. Given an example, the final classification of the composite classifier 
relies on the votes of all the individual models. The vote of the model Mk is worth 
ak units. The voting scheme is simply summing up the vote for the predicted 
class of every individual model. 



3 Naive Bayesian Classifier Learning and Levelled Naive 
Bayesian Tree Learning 

In this section, we describe two base learning algorithms. One is the naive 
Bayesian classifier [7, 11, 12]. We will show that boosting does not significantly 
improve the performance of the naive Bayesian classifier. The other base learn- 
ing algorithm is for generating levelled naive Bayesian trees. The objective of 
this algorithm is to investigate whether and how we can make boosting perform 
better for naive Bayesian classification. 

3.1 Naive Bayesian Classifiers 

The Bayesian approach to classification estimates the (posterior) probability 
that an instance belongs to a class, given the observed attribute values for the 
instance. When making a categorical rather than probabilistic classification, the 
class with the highest estimated posterior probability is selected. 
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The posterior probability, P{Cj\V), of an instance being class Cj, given the 
observed attribute values V =< ui,U 2 ,...,u„ >, can be computed using the 
a priori probability of an instance being class Cj, P{Cj)] the probability of an 
instance of class Cj having the observed attribute values, P{V\Cj)] and the 
prior probability of observing the attribute values, P{V). In the naive Bayesian 
approach, the likelihoods of the observed attribute values are assumed to be mu- 
tually independent for each class. With this attribute independence assumption, 
the probability P{Cj\V) can be re-expressed as: 

p{c^\v ) = n n (4) 

^ ' i=i i=i 

Note that since P{V) is the same for all classes, its value does not affect the final 
classification, thus can be ignored. In our implementation of the naive Bayesian 
classifier, probabilities are estimated using frequency counts with an m-estimate 
correction [4]. Continuous attributes are pre-processed using an entropy based 
discretization method [8] . 

3.2 Levelled Naive Bayesian Trees 

Our motivation for designing the levelled naive Bayesian tree learning algorithm 
is to introduce instability to naive Bayesian classification. We use the maximum 
depth of a tree, as a parameter of the algorithm, to control the level of tree 
structures incorporated into naive Bayesian classification. 

The Levelled Naive Bayesian Tree learning algorithm, called Lnbt, has the 
same tree growing procedure as in a Top-Down Decision Tree induction algo- 
rithm. Like C4.5 [14], Lnbt also uses the information gain ratio criterion to 
select the best test at each decision node during the growth of a tree. The dif- 
ferences between Lnbt and C4.5 are: (i) Lnbt grows a tree up to a pre-defined 
maximum depth (d); (ii) Lnbt does not perform pruning; (iii) Lnbt generates 
one local naive Bayesian classifier at each leaf using all training examples at 
that leaf and all attributes that do not appear on the path from the root to the 
leaf; and (iv) Lnbt stops splitting a node, if it contains less than 30 training 
examples, to ensure reasonably accurate estimates for local naive Bayesian clas- 
sifiers. When the maximum tree depth d is set at 0, Lnbt is identical to a naive 
Bayesian classifier generator. Therefore, the naive Bayesian classifier is referred 
to as Lnbt(O) later on in this paper. 

4 Experiments 

In this section, we first empirically show the failure of boosting the naive Bayesian 
classifier. Then, we incorporate tree structures into naive Bayesian classification, 
and show that this will improve the performance of boosting for naive Bayesian 
classification. We refer BLNBT(d>l) to boosted levelled naive Bayesian classifier, 
and BLNBT(d=0) to boosted naive Bayesian classifier. The parameter K control- 
ling the number of models generated in boosting is set at 100 for all experiments. 
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Table 1. Description of the domains used in the experiments. 



Dataset 


Training 
Data size 


Testing 

Method/Size 


# Classes 


^ Attr 


Waveform 


300 


10 runs/5000 


3 


40C 


Led24 


500 


10 runs/5000 


10 


24B 


Euthyroid 


3163 


2 X 5cv 


2 


18B 7C 


Hypothyroid 


3163 


2 X 5cv 


2 


18B 7C 


Splice 


3177 


2 X 5cv 


3 


60N 


Abalone 


4177 


2 X 5cv 


3 


IN 7C 


Nettalk (stress) 


5438 


2 X 5cv 


5 


7N 


Nursery 


15960 


2 X 5cv 


5 


8C 


Coding 


20000 


2 X 5cv 


2 


15N 


Adult 


32561 


1 run/16281 


2 


IB 7N 6C 


DNA 


2000 


1 run/1186 


3 


60N 


Satellite 


4435 


1 run/2000 


6 


36C 


Letter 


15000 


1 run/5000 


26 


16C 


Shuttle 


43500 


1 run/14500 


7 


9C 



N: nominal; B: binary; C: Continuous 



It is interesting to see the performance improvement that can be gained by two 
orders of magnitude increase in computation. 

Fourteen domains of moderate to large data sizes from the UCI machine 
learning repository [2] are used in the experiments. These data sizes provide 
a more reliable assessment of the algorithms than small ones. Table 1 gives a 
description of these domains and the testing methods and data sizes used in the 
experiments. DNA, Satellite, Letters and Shuttle are four datasets used in the 
Statlog project [13]. 

The main concern of this study is the accuracy performance of learning algo- 
rithms. Therefore, we report the error rate of each algorithm under consideration 
on the test set in each domain. To compare two algorithms, Avs B for example, 
we provide the error rate ratio of A over B: CA/f-B in each domain, where ca is 
the test error rate of A and is the test error rate of B. If multiple runs are 
conducted in a domain, the reported value is an average value over the runs. 
To compare two algorithms, A vs B, across the 14 domains, the w/t/1 record is 
reported, which is the numbers of wins, ties, and losses between the error rates 
of A vs B in the 14 domains. A one-tailed pairwise sign-test is applied on this 
w/t/1 record to examine whether A performs better than B significantly more 
often than the reverse at a level better than 0.05. The significance level p will 
be provided for the sign-test on each w/t/1 record. 

4.1 Boosting Naive Bayesian Classifiers 

Table 2 shows the error rates of the naive Bayesian classifier and the boosted 
naive Bayesian classifier as well as their error ratios. From this table, we can 
see that although boosting can improve the accuracy of the naive Bayesian clas- 
sifier in 9 out of the 14 domains, it does decrease the accuracy of the naive 
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Table 2. Error rates (%) of the naive Bayesian classifier and the boosted naive Bayesian 
classifier as well as their error ratios. 



Dataset 


1 Error rate (%) 


Error ratio 
Blnbt(0)/Lnbt(0) 


Lnbt(O) 


Blnbt(O) 


Wave 


20.20 


26.72 


1.32 


Led24 


27.48 


28.39 


1.03 


Euthyroid 


4.14 


3.21 


.77 


Hypothyroid 


1.49 


1.31 


.88 


Abalone 


41.19 


40.73 


.99 


Nettalk(s) 


16.03 


15.04 


.94 


Splice 


4.63 


7.29 


1.57 


Coding 


28.75 


28.72 


1.00 


Nursery 


9.80 


7.73 


.79 


Adult 


16.49 


15.03 


.91 


DNA 


5.48 


7.08 


1.29 


Satellite 


17.70 


16.60 


.94 


Letter 


25.26 


24.12 


.95 


Shuttle 


0.18 


0.09 


.50 


mean 


15.63 


15.86 


.99 


w/t/l 

p 






9/1/4 

.1334 



Bayesian classifier in 4 other domains. A one-tailed pairwise sign-test fails to 
show that boosting reduces the error of the naive bayesian classifier significantly 
more often than the reverse across the 14 domains at a level of 0.05. When 
boosting decreases the error of the naive Bayesian classifier, the error reduc- 
tion is often small: Only in one domain does boosting reduce the error of the 
naive Bayesian classifier by more than 25%. The mean relative error reduction 
of boosting over the naive Bayesian classifier in the 14 domains is only 1%, indi- 
cating very marginal improvement due to boosting. In terms of the mean error 
rate, boosted naive Bayesian classifier is worse than the naive Bayesian classifier 
by 0.23 percentage points across the 14 domains, indicating the poor average 
performance of boosting. 

In short, boosting does not work well with naive Bayesian classification. This 
is in sharp contrast to the success of boosting decision trees. There are two 
possible explanations for this observation. First, the naive Bayesian classifier 
could be optimal in some domains. In such a case, we cannot expect boosting to 
further improve its accuracy. For example, all the attributes are independent for 
each class in the Led24 domain. Thus, the naive Bayesian classifier is optimal in 
this domain. Consequently, a relative increase of 3% error due to boosting in the 
Led24 domain is not surprising. The other explanation is that the naive Bayesian 
classifier is quite stable with respect to small changes to training data, since these 
changes will not result in big changes to the estimated probabilities. Therefore, 
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Table 3. Error rates (%) of LNBT(d) and error ratios of BLNBT(d) vs LNBT(d) for d = 
0, 1, 3, 5, and 10. 



Dataset 


1 BLNBT(d) VS LNBT(d) | 


d = 


: 0 


d = 


1 


d = 


3 


d = 


5 


d = 


10 


err(%) 


ratio 


err(%) 


ratio 


err(%) 


ratio 


err(%) 


ratio 


err(%) 


ratio 


Wave 


20.20 


1.32 


24.78 


.95 


29.34 


.73 


29.34 


.73 


29.34 


.73 


Led24 


27.48 


1.03 


29.28 


1.03 


31.14 


1.11 


30.04 


1.17 


30.04 


1.16 


Euthyroid 


4.14 


.77 


3.59 


.82 


2.88 


1.18 


2.88 


1.15 


2.88 


1.11 


Hypothyroid 


1.49 


.88 


1.50 


.98 


1.56 


.96 


1.56 


.90 


1.56 


.91 


Abalone 


41.19 


.99 


37.31 


.99 


38.11 


1.00 


38.34 


1.02 


38.34 


1.02 


Nettalk(s) 


16.03 


.94 


13.41 


1.11 


16.28 


.74 


16.28 


.74 


16.28 


.74 


Splice 


4.63 


1.57 


4.91 


1.35 


8.06 


.61 


9.21 


.50 


9.22 


.46 


Coding 


28.75 


1.00 


27.92 


.97 


26.33 


.70 


26.42 


.66 


26.42 


.66 


Nursery 


9.80 


.79 


8.16 


.82 


5.56 


.11 


3.55 


.05 


3.55 


.08 


Adult 


16.49 


.91 


16.43 


.92 


15.92 


.97 


15.78 


1.00 


16.21 


.98 


DNA 


5.48 


1.29 


6.24 


.95 


8.09 


.55 


8.85 


.54 


9.70 


.44 


Satellite 


17.70 


.94 


15.95 


.87 


17.30 


.66 


17.60 


.64 


17.60 


.62 


Letter 


25.26 


.95 


20.64 


.74 


21.12 


.51 


22.80 


.39 


22.78 


.37 


Shuttle 


0.18 


.50 


0.09 


.69 


0.09 


.53 


0.08 


.49 


0.08 


.49 


mean 


15.63 


.99 


15.02 


.94 


15.84 


.74 


15.91 


.71 


16.00 


.70 


w/t/l 

p 






^msnm 







boosting naive Bayesian classifiers may not be able to generate multiple models 
with sufficient diversity. These models cannot effectively correct each other’s 
errors during classification by voting. In such a situation, boosting cannot greatly 
reduce the error of naive Bayesian classification. The experiments in the following 
subsection are designed to provide support to this argument. 

4.2 Boosting Levelled Naive Bayesian Trees 

Decision tree learning is not stable in the sense that small changes to training 
data may result in different decision trees. We expect that introducing tree struc- 
tures into naive Bayesian classifiers can increase the instability of naive Bayesian 
classification, thus improving the effectiveness of boosting for naive Bayesian 
classification. To investigate this issue, we compare levelled naive Bayesian Trees 
with its boosted version by gradually increasing the maximum depth of the trees. 
The results are shown in Table 3, which indicates the error rates of LNBT(d) and 
error ratios of BLNBT(d) over LNBT(d) for d = 1, 3, 5, and 10. As a base line 
for comparison, the results of Lnbt(O) and Blnbt(O) are also included in the 
table. From this table, we have the following observations: 

• Comparing BLNBT(d) with LNBT(d), boosting continues to improve the av- 
erage accuracy of the levelled naive Bayesian tree as we increase the depth 
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Mean error ratio 



Mean error (%) 




(a) (b) 

Fig. 1. Error ratio and mean error rate in the 14 domains as functions of d 



of the tree. Similarly, the error ratio of BLNBT(d) over LNBT(d) continues 
to decrease from .99 to .70 as d increases from 0 to 10. Note that the results 
for d = 10 are very close to those for d = 5, since LNBT(d) does not build 
trees with a depth more than 5 in many domains. 

• In all cases, boosting reduces the error of the levelled naive Bayesian tree 
significantly more often than the reverse at a level better than 0.05. 

• Introducing tree structures into the naive Bayesian classification increases 
the error on average. After a small decrease in error resulted from introducing 
1-level tree structures into the naive Bayesian classification, introducing more 
tree structures consistently increases the error on average: from 15.02 to 16.00 
when d increases from 1 to 10. 

The last observation above indicates that the relative accuracy improvement 
of boosting for naive Bayesian classification when tree structures are added which 
is shown in Table 3 might not be interesting, since the algorithms being compared 
to have higher error than the naive Bayesian classifier. To clarify this issue, we 
use the naive Bayesian classifier Lnbt(O) as the base classifier for comparison. 
We analyze the error ratio of LNBT(d) over Lnbt(O), and the error ratio of 
Blnbt(cI) over Lnbt(O) for d = 1,3, 5, and 10, and present them in Figure 1(a). 
The mean error rates of LNBT(d) and Blnbt(cI) in the 14 domains are depicted 
in Figure 1(b). From this analysis, we obtain the following observations: 

• As expected from the last observation, the mean error ratio of LNBT(d) over 
the naive Bayesian classifier in the 14 domains consistently increases when 
d increases from 1, indicating that introducing tree structures into naive 
Bayesian classification decreases its accuracy while increasing its instability. 

• The mean error ratio of BLNBT(d) over the naive Bayesian classifier in the 14 
domains keeps decreasing when d increases from 1 to 10. In addition, we find 
that the mean error rate of Blnbt(cI) in the 14 domains consistently reduces 
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from 15.86% to 14.32% to 12.63% to 12.52% to 12.41%, when d increases 
from 0 to 10 as shown in Figure 1(b). 

• The sign-test shows that BLNBT(d) is significantly more accurate than the 
naive Bayesian classifier more often than the reverse across the 14 domains 
at a level better than 0.05 for d = 5 and 10. The significance level for d = 1 
and 3 is .0898, close to this cut point. 

• When d = 10, BLNBT(d) achieves 27% mean relative error reduction over the 
naive Bayesian classifier in the 14 domains — a great improvement in compar- 
ison to the 1% reduction when boosting the naive Bayesian classifier. Also, 
BLNBT(d)’s mean error rate is 3.22 percentage points lower than that of the 
naive Bayesian classifier. In contrast, boosting the naive Bayesian classifier 
increases the mean error rate by 0.23 percentage points. In 6 domains, the 
error rate of Blnbt(IO) is lower than that of Lnbt(O) by 25% or more. 

• Compared to Blnbt(O), the error rate of Blnbt(IO) is 27% lower on average 
over the 14 domains, although the base learning algorithm of Blnbt(IO) has 
an mean error 4% higher than the base learning algorithm of Blnbt(O). A 
sign-test shows that Blnbt(IO) has lower error than Blnbt(O) significantly 
more often than the reverse at a level .0287 across the 14 domains. 

In summary, all the experimental results show that introducing tree struc- 
tures into the naive Bayesian classification can lead to the success of boosting for 
naive Bayesian classification. Introducing tree structures seems to increase the 
instability of naive Bayesian classification. Although the levelled naive Bayesian 
tree learner is less accurate than the naive Bayesian classifier on average, boost- 
ing levelled naive Bayesian trees can produce significantly higher accuracy than 
the naive Bayesian classifier, performing much better than directly boosting the 
naive Bayesian classifier. 

5 Conclusions 

Although boosting achieves great success with decision tree learning, we experi- 
mentally show that boosting does not work well for naive Bayesian classification. 
There are two possible reasons for this: (i) The naive Bayesian classifier is opti- 
mal in some domains; (ii) The naive Bayesian classifier is relatively stable with 
respect to small changes to training data. We cannot expect to further increase 
the accuracy of the naive Bayesian classifier in the first situation. For the second 
situation, we proposed to introduce tree structures into naive Bayesian classifi- 
cation to increase its instability, and expect that this can improve the success of 
boosting for naive Bayesian classification. 

We have conducted a set of experiments on boosting naive Bayesian clas- 
sification with tree structures in 14 natural domains. The experimental results 
show that although introducing tree structures into naive Bayesian classification 
increases the average error of naive Bayesian classification for individual models, 
boosting naive Bayesian classifiers with tree structures can achieve significantly 
lower average error than the naive Bayesian classifier, providing a method of 
successfully applying the boosting technique to naive Bayesian classification. 
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Abstract. This paper investigates modelling concepts as a few, large 
convex hulls rather than as many, small, axis-orthogonal divisions as is 
done by systems which currently dominate classification learning. It is 
argued that this approach produces classifiers which have less strong hy- 
pothesis language bias and which, because of the fewness of the concepts 
induced, are more understandable. The design of such a system is de- 
scribed and its performance is investigated. Convex hulls are shown to be 
a useful inductive generalisation technique offering rather different biases 
than well-known systems such as C4.5 and CN2. The types of domains 
where convex hulls can be usefully employed are described. 

Keywords: convex hulls, induction, classification learning. 



1 Introduction 

Classification learning has been dominated by the induction of axis-orthogonal 
decision surfaces in the form of rule-based systems and decision trees. While 
the induction of alternate forms of decision surface has received some attention, 
in the context of non-axis orthogonal decision trees, statistical clustering algo- 
rithms, instance based learning and regression techniques [8,9,1,23,11,4,3,20,7], 
this issue has received little attention in the context of decision rules. This pa- 
per is concerned with the construction of convex polytopes in N-space and the 
interpretation of these as rule-like structures. 

Typically, machine learning systems produce many rules per class and, al- 
though each rule may be individually comprehensible, not all will be and holistic 
appreciation of concepts modeled may be impossible due to their fragmentary 
presentation. Each concept, constructed as a large convex polytope, is expected 
to correspond closely to a single underlying concept of the domain. The form of 
these concepts also gives access to work on extracting mathematical descriptions 
via the techniques of computational geometry including diameters of polytopes, 
intersections and equations for the surfaces of poly topes [14]. 

Concept representation using regions, which are less strongly constrained 
in their shape than hyperrectangles, will reduce the hypothesis language bias, 
which might be expected to be particularly strong in axis orthogonal systems. 
The performance of a prototype system [13] indicates that the technique of using 
geometrically minimal enclosing hyperpolygons to represent concepts is sound 
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and has the potential to infer more accurate classifiers than Straight Axis Parallel 
(SAP) and Straight Non- Axis Parallel (SNAP) systems in Curved Edge (CE) 
domains. The main problem uncovered in the prototype was spiking which is 
the inappropriate association of points from separated regions of the same class 
causing misclassification of points in the intervening region. Insisting on the 
convexity of the polytopes eliminates this possibility. A tight-fitting polytope 
offers generalisations which are of much smaller volume than generalisations 
which are minimal in terms of axis orthogonally biased hypothesis languages. 
This offers a much more conservative induction process and might be expected to 
make fewer errors in making positive predictions. The classification performance 
of a convex hull based system is compared to other well-understood systems. 

When a classifier produces concepts of a different shape from the underlying 
concept, there will always be error in the representation. This is expected from 
the Law of Conservation of Generalisation Performance [17] or the No Free 
Lunch Theorem [22], notwithstanding the caveats of Rao et al. [16], but is still 
a problem as limited data limits the accuracy of the representation. Since the 
geometric approach is considerably less limited in the orientation and placement 
of decision surfaces, one might expect better performance than axis orthogonal 
systems can provide over a range of domains where the underlying decision 
surfaces are not axis orthogonal. 

2 Implementation of CHI 

The perceived problem with SAP biased induction systems is that they will 
represent SNAP concepts by a possibly large number of small regions. More 
comprehensible concept descriptions will be obtained if single concepts can be 
represented by single induced structures so CHI, will construct large aggregates 
of instances. Instances of other classes within such large structures will be treated 
as exceptions and will be modeled separately subsequently. It is not possible to 
use an incremental approach to building convex hulls, such as that employed 
in the prototype system, PIGS [13], because of the problems with spiking [12]. 
In the CHI algorithm, construct-next-rule () constructs a convex hull, using the 
quickhull software [19] which is an implementation of the quickhull algorithm [2] 
using a variant of Griinbaum’s Beneath-Beyond Theorem [6], around all points 
of the most frequent class in the misclassified points set and prepends [21] it 
to the decision list of hulls. Each hull is defined by the intersection of a set of 
planes each of which is defined by the components of an outward-pointing, unit 
normal to the plane and its perpendicular distance from the origin. Thus each 
facet in N dimensional space is described by N-l-1 real numbers. Techniques for 
rule placement other than simple prepending were investigated [12] but simple 
hill-climbing heuristics were found to frequently get trapped in local minima 
and so were abandoned. The algorithm handles categorical values by having, 
associated with each hull, a set of valid values for each categorical attribute. To 
be covered by a rule, the categorical attribute values must be in the set and 
the continuous attributes must be covered by the hull. The current approach is 
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oriented towards ordinal attributes and is not intended for use in domains with 
many categorical attributes. The algorithm will construct convex hulls where 
possible but will resort to an Axis Orthogonal hull (i.e. hyperrectangle) if nec- 
essary. The algorithm continues until all training points are classified correctly 
or the algorithm cannot reduce the number of unclassified points. 

The CHI algorithm is stated as 

CHI 

Rule-list = empty 
FOR all classes 

EXTRACT unclassified-points[this_class] from training-set 
prev-num-misclassified-pts[this_class] = oo 
ENDFOR 

construct-next-rule and update num-misclassified-pts 
WHILE num-misclassified-pts > 0 

AND num-misclassified-pts < prev-num-misclassified-pts[this_class] 
prev-num-misclassified-pts[this_class] = num-misclassified-pts 
misclassified-points = all points in training-set not correctly 
classified by Rule-list 

construct-next-rule and update num-misclassified-pts 
ENDWHILE 
END. 

CONSTRUCT-NEXT-RULE 

SET best rule to empty 

COUNT number of each class in misclassified points 
FOR all classes 

IF number in this class is zero 
continue to next class 
ENDIF 

extract-next-rule 

EVALUATE DL with this new rule prepended 
IF this rule is best so far 

SAVE rule to best rule 

ENDIF 

ENDFOR 

INSERT best rule in DL in appropriate position 
CONSTRUCT new misclassihed points list 

END. 

EXTRACT-NEXT-RULE 

FOR all points of this class in misclassified points list 
FOR all attributes 

IF categorical 

SET array entry in rule 

ELSE 

PUT in continuous attribute Hie 
ENDIF 
ENDFOR 

IF not enough points to construct hull OR forcing AOJHULL 



Convex Hulls in Concept Induction 



309 



construct AOJrtULL 

ELSE 

construct convex hull 
IF qhull fails due to degenerate geometry 
construct AO_HULL 
ENDIF 
ENDIF 
ENDFOR 
END. 



For an initial comparison of CHI and C4.5 [15], artificial data sets [12] with 
mostly curved edges (2, 3 & 4 dimensional spheres and pairs of concentric spheres 
and one set of mixed shapes) were used so that the density of data points, 
the dimensionality of the data set and the complexity of the data set could be 
controlled. C4.5 was chosen as a well-known exemplar of an SAP-biased system. 
All experiments in this paper use a randomly selected and randomly ordered 
80% of the data for training and 20% for testing. The resulting accuracies are 
presented in Table 1, the first result in each box is that for C4.5, the second is 
CHI and an shows where there was superiority at p = 0.05 in a matched 
pairs t-test. For small data sets, C4.5 tends to have better accuracy. As the 



Data size 


1000 


2000 


4000 


8000 


15000 


Circle 


96.28/96.55 


97.46/97.25 


97.73/98.20* 


98.54/98.86* 


98.89/99.29* 


Sphere 


90.66/90.69 


92.84/93.36* 


94.57/95.19* 


95.31/96.57* 


95.97/97.47* 


Hyp-sphr 


*90.42/89.63 


*92.41/90.70 


*94.04/93.50 


*94.87/94.10 


95.81/95.83* 


CnCrcl 


92.16/92.89* 


93.97/94.52* 


95.57/96.83* 


96.84/98.80* 


97.65/98.7* 


CnSphr 


*88.95/88.32 


85.01/86.29* 


87.89/90.50* 


90.54/92.92* 


92.13/94.71* 


CnHySp 


*81.36/78.12 


*84.21/81.92 


*87.21/86.24 


*89.88/89.49 


90.78/91.62* 


RCC 


*95.06/93.80 


*96.75/95.97 


97.45/97.62* 


98.10/98.42* 


98.74/99.01* 



Table 1. C4.5 and CHI Accuracy on Artificial Data Sets 



size of the data set rises, CHI always becomes significantly better. The point at 
which the change takes place becomes higher as the dimensionality increases. It is 
possible that CHI’s need for large numbers of data points is a consequence of its 
less stringent geometric bias. It chooses between a greater number of classifiers 
because it has more options in terms of position and orientation of each decision 
surface and, thus, it needs more data to make good choices. If it has sufficient 
data then its greater flexibility should enable it to make better classifiers. 

2.1 Time Complexity of CHI 

If a learning domain contains n points with c classes in a space of dimension d, 
it is expected from design criteria that there will be approximately c groups of 
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unclassified points to be processed and the expected number of facets per hull is 
log'^”^ ^ [14]. Since a linear search is used for the c hulls, the time complexity for 
classifying n points is 0(nc log'^”^ ^). The time complexity of the hull creating 
algorithm can be shown [12] to be 

0{c^ * log'^”^ n * nL('^+i)/2j^ 

3 Modification of Convex Hnlls 

A convex hull based system will tend to construct enclosures which are a subset 
of the underlying concept. It can be shown [12] that a hull can be simply inflated 
like a balloon and that this will produce a statistically significant improvement 
in its classification accuracy. Another technique for generalising convex hulls 
is to remove facets which do not contribute to its resubstitution classification 
accuracy. Two techniques for this are explored in [12]. Typically, hulls are pruned 
from around 100 facets to between 1 and 5 facets. 

4 Evaluation of CHI 

4.1 Evaluation on Selected Domains 

This learning system will initially be evaluated on domains for which there is 
reasonable expectation that the decision surfaces are either curved or are straight 
but not axis orthogonal. The performance of CHI will be compared with that of 
C4.5, CN2 and OCl. The former two are presented as exemplars of SAP learning 
systems and the latter is included as an example of SNAP learning. 



Body Fat The BMI data set [23] has attributes of height and weight of persons 
and they are classified according to their Body Mass Index (BMI) [18] which is 
the weight (kg) divided by the square of the height (metres). In this experiment, 
the height and the weight will be used as the attributes and the decision surfaces 
are known, from a consideration of the fact that BMI oc to be distinctly 
curved. Thus, in this domain, it is expected that C4.5, CN2 [5] and OCl [11], with 
their preference for long straight lines, will be poorly biased whereas CHI will 
be neutrally biased. The data set, derived from frequency tables for height and 
weight in [18], can be viewed as a realistic simulation of a real-world classification 
problem. 

This experiment was carried out, with a range of data set sizes, 20 times 
as this was sufficient to obtain statistically significant results. The results are 
plotted in Figure 1. The win-loss ratio for CHI against CN2 is 20:0 and this is 
significant at p = 0.01 using a sign test. Similarly, the win-loss ratio for CHI 
versus C4.5 is also 20:0 and is significant at the same level. Comparing CHI with 
OCl, the win-loss ratio is 12:3 which is significant at p = 0.05 using a sign test. 
On all data sets of size less than 1500, CHI provides a better model of the data. 
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Fig. 1. Learning Curses for Body Fat 

Fig. 1. Learning Curves for Body Fat 




With larger amounts of data, the performance of OCl is slightly superior. A 
possible explanation is that with low data densities, OCl generates unsuitably 
large decision surfaces which do not match the underlying concepts closely but 
which are not invalidated in the training set. At high data densities, OCl is 
constrained from constructing overly large decision surfaces and its performance 
becomes very close to that of CHI since its biases happen to suit the class 
distributions in this domain marginally better than those of CHI. 



POL Parallel oblique lines (POL) is an artificial data set which suits the in- 
ductive biases of OCl [11]. Since the decision surfaces are known to be straight 
lines at 45 degrees, the strong SAP bias of C4.5 and CN2 should reduce their 
performance but the performance of CHI should be superior since it can provide 
decision surfaces close to the correct orientation. OCl uses all of the neighbour- 
ing points to orient a single large surface where CHI has to place and orient, 
possibly, several surfaces from the same amount of information. Therefore, OCl 
should provide the best performance of the methods being compared. Twenty 
repetitions of comparisons with matched data sets of various sizes were sufhcient 
to obtain statistically significant results. The set for 500 points is typical and is 
shown in Table 2 The mean accuracy for CHI is superior to that of both CN2 
and C4.5 at p = 0.01 using a matched pairs t-test. The mean accuracy of OCl 
is similarly superior to that of CHI at p = 0.01. Comparing CHI and CN2 using 
a sign test, the win-loss ratio is 18:2 in favour of CHI which is significant at 
p = 0.01. Comparing CHI and C4.5 using a sign test, the win-loss ratio is 17:3 
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CHI 


CN2 


C4.5 


OCl 


Mean 


97.3^’^ 


95.1 


95.2 


99.2" 


Std.Dev. 


1.4 


1.1 


1.1 


0.4 



Table 2. Accuracy on POL Data Set 



in favour of CHI which is significant at p = 0.01. Lastly, comparing CHI with 
OCl, the win-loss ratio is 1:19 in favour of OCl and is significant at p = 0.01. 

These results clearly show that CHI outperforms the SAP systems on a 
distinctly SNAP domain. However, the bias of OCl for straight SNAP decision 
surfaces exactly suits this domain and OCl is far superior to CHI, CN2 and C4.5. 

These two experiments suggest that CHI will provide significantly superior 
performance to axis orthogonally biased classifier systems on strongly SNAP or 
curved concepts. Particularly, on curved concepts, it provides superior perfor- 
mance to SAP systems and better performance than OCl at low and medium 
data densities. At high data densities, the performance of OCl may overtake 
that of CHI. 

4.2 Evaluation on a Variety of Domains 

The evaluation was extended to include a cross-section of well-known, mainly 
continuous, domains from the UCI Repository [10]. An experiment, using 
matched pairs comparisons, was run 100 times on each domain using CHI, C4.5 
and CN2 and the results are shown in Table 3. Because the quickhull software 
has infeasible run-times on some datasets, only randomly selected subsets of the 
available data were used. Sizes of data sets used are shown in the result table. 

Comparing CHI with C4.5, a win-loss ratio of 3:22 is significant at p = 0.01. 
Similarly comparing CHI and CN2, a win-loss ratio of 4:21 is also significant 
at p = 0.01. There is no significant difference between C4.5 and CN2 with a 
win-loss ratio of 14:10. Worse is the fact that when CHI is superior, it is never 
markedly superior but when it is worse, it can be considerably worse. 

This is a disappointing result and CHI is only superior on balance-scale, 
echocardiogram and ionosphere. This raises the possibility that, since most UCI 
Repository data sets may have been collected and defined in the context of SAP- 
based systems, the Repository may have a preponderance of data sets suited to 
such systems. If our understanding of how the performance of various sytems 
depends on their language bias and the shapes of the underlying concepts is 
correct, it would be expected that OCl would have worse predictive accuracy 
than CHI on the three data sets identified above. 

This will be examined experimentally with 20 runs of the comparison exper- 
iment. The results are shown in Table 4. 

The results show that CHI is significantly superior to OCl for the echocardio- 
gram and ionosphere data sets, using a matched pairs t-test at p = 0.05. However, 
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Domain 


Accuracy 


No. 


No. CHI 


No. C4.5 


(Size) 


CHI 


C4.5 


CN2 


Concepts 


Hulls 


Regions 


balance-scale (625) 


83.26""’^ 


76.67 


81.30^ 


3 


5.6 


108.2 


B.canc. Wise. (699) 


99.06^ 


99.42^’^ 


97.75 


2 


2.8 


33.2 


bupa(345) 


58.77 


63.49" 


64.42"’^ 


2 


3.7 


95.8 


cleveland(30) 


55.54 


87.73"’^ 


87.64" 


2 


3.7 


66.8 


echocardiogram(62) 




69.11“ 


67.90 


2 


2.6 


13.5 


german(135) 


50.71 


60.66"’“ 


57.13" 


2 


7.3 


76.4 


glass(214) 


64.14 


73.31" 


76.28"’^ 


3 


5.9 


43.5 


glass7(214) 


55.21 


66.18" 


66.34"’"* 


7 


13.0 


49.4 


heart (34) 


67.71 


80.29" 


81.53"’^ 


2 


3.7 


11.4 


hepatitis(35) 


40.51 


41.57"’“ 


41.27" 


2 


2.0 


4.4 


horse-colic(40) 


62.01 


77.03" 


87.78"’"* 


2 


2.3 


9.6 


hungarian (70) 


62.48 


88.55"’“ 


86.07" 


2 


3.6 


55.2 


ionosphere(351) 




87.71 


89.69^ 


2 


5.0 


29.4 


iris(150) 


69.85 


94.74"’“ 


94.13" 


3 


3 


8.6 


new-thyroid(215) 


75.78 


91.83" 


95.30"’^ 


3 


3.0 


14.5 


page-blocks(140) 


57.47 


87.20" 


90.34"’^ 


5 


7.6 


14.8 


pid(150) 


80.30 


82.34" 


82.95"’"* 


2 


5.1 


39.0 


satimage (225) 


72.36 


91.33"’“ 


90.16" 


6 


5.8 


19.4 


segment (1200) 


91.94 


94.41"’“ 


94.32" 


7 


22.0 


63.0 


shuttle(lOO) 


66.32 


93.84"’“ 


92.05" 


7 


4.8 


9.0 


sonar(208) 


56.64 


74.12"’“ 


70.50" 


2 


2.5 


31.6 


soybean(200) 


81.01 


90.68"’“ 


84.60" 


19 


19.7 


70.2 


vehicle(30) 


45.03 


80.61"’“ 


79.36" 


4 


4.5 


20.2 


waveform(30) 


35.80 


72.76"’“ 


65.70" 


3 


3.3 


17.2 


wine(35) 


84.04 


87.01"’“ 


85.45" 


3 


3.0 


39.5 



Table 3. Comparison of CHI, C4.5 and CN2 



OCl is similarly superior to CHI on the balance-scale data set which was unex- 
pected. Closer examination of the data sets shows that the attribute values are 
essentially continuous for echocardiogramand ionosphere but that those of the 
balance-scale are entirely integer values. These will make the decision surfaces 
consist of flat surfaces between integer values and thus OCl, which favours flat 
surfaces, will be suitably biased and the result is understandable. The continu- 
ous attributes of the other domains suggest that flat surfaces are an unsuitable 
bias and that OCl will perform poorly as found. 

The interaction between the underlying concept geometry and the induced 
structures can be explained thus. If the data sets are SAP biased, then SAP 
classifiers have the correct orientation of the decision surface automatically and 
only have to position the surface from the evidence. A SNAP system has to 
decide both the orientation and the position of the decision surface from the 
same amount of evidence so it has a bigger space of theories to explore with the 
same evidence. 
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Accuracy 


CHI 


OCl. 


balance-scale 


78.7 


91.5^ 


echocardiogram 


73.0^ 


65.5 


ionosphere 


91.2'' 


76.8 



Table 4. Accuracy of CHI and OCl on Selected Domains 



5 Complexity of Domain Representations 



One advantage advanced for the use of convex hulls is that, from a human 
comprehension viewpoint, the number of structures induced is similar to the 
number of underlying concepts. For CHI, the number of structures is the number 
of convex hulls which are constructed and for C4.5 the number of structures 
is the number of hyperrectangular regions which are identified. The number of 
actual concepts in each domain and the number of concepts induced by CHI and 
C4.5, averaged over 100 runs, are shown in columns 4-6 of Table 3. The average 
number of concepts for satimage and shuttle are lower than one might expect 
because the sample used contains effectively only 5 classes for satimage rather 
than 6 and, for shuttle, contains about 3 classes rather than 7. The smallness of 
the number of concepts induced by CHI can be seen in comparison with C4.5. 
The number of hulls induced by CHI is always very close to the number of actual 
concepts in each domain. Using a sign test, the number of hulls produced by CHI 
is superior to C4.5 at p=0.01. 



6 Conclusions 



Convex hulls have been shown to be a simple, satisfactory method for construct- 
ing useful polytopes offering smaller generalisations than techniques which use 
hypothesis language based generalisation. The use of facet deletion and inflation 
have been noted as improving the performance of the classifier. 

The performances of differing systems has been shown to depend on the ori- 
entation of the surfaces which can be induced and those of the actual underlying 
concept. If the SAP classifier has the same orientation for its decision surfaces 
as the underlying concepts of the domain then it will always outperform CHI 
because it is well biased. 

It has been demonstrated that the advantages offered by the use of convex 
hulls, rather than SAP-based classifiers are:- 
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— a classifier with less bias in terms of the geometry of induced concepts. 

— a classifier which induces one large structure per concept rather than many 
small structures is philosophically appealing and affords an economy of rep- 
resentation which SAP systems cannot provide. 

— CHI with fewer, larger hulls offers classification accuracy which is superior 
to well known systems on data sets where the underlying concepts are known 
to be curved or SNAP. On data sets with SAP underlying concepts, CHI 
needs higher data densities to be competititve. 

— convex hulls offer access to mathematical descriptions of modeled concepts. 

The disadvantages of CHI are that after the effort to create the convex hull, it 
is immediately deconstructed using facet deletion and inflation and that many 
commonly used data sets may have underlying SAP concepts. CHI needs more 
data than an SAP system since it has to position and orient line segments rather 
than just place them as an SAP system does. On all domains examined with 
known curved or SNAP concepts, CHI outperforms both C4.5 and CN2. 
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Abstract. Data classification is an important research topic in the field 
of data mining and knowledge discovery. There have been many data 
classification methods studied, including decision-tree method, 
statistical methods, neural networks, rough sets, etc. In this paper, we 
present a new mathematical representation of qualitative concepts — 
Cloud Models. With the new models, mapping between quantities and 
qualities becomes much easier and interchangeable. Based on the cloud 
models, a novel qualitative strategy for data classification in large 
relational databases is proposed. Then, the algorithms for classification 
are developed, such as cloud generation, complexity reduction, 
identifying interacting attributes, etc. Finally, we perform experiments 
on a challenging medical diagnosis domain, acute abdominal pain. The 
results show the advantages of the model in the process of knowledge 
discovery. 



1 Introduction 

Data classification is an important research topic in the field of data mining and 
knowledge discovery. It finds the common properties among a set of objects in a 
database and classifies them into different classes. There have been many data 
classification methods studied, including decision-tree method, statistical methods, 
neural networks, rough sets, etc. 

In machine learning studies, a decision-tree classification method, developed by 
Quinlan[l], has been influential. A typical decision tree learning system is ID-3[1], 
which adopts a top-down irrevocable strategy that searches only part of the search 
space. It guarantees that a simple, but not necessarily the simplest, tree is found. An 
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extension to ID-3, C4. 5 [2], extends the domain of classification from categorical 
attributes to numerical ones. In [3], Shan. et. al. proposed an novel approach, which 
uses rough sets to ensuring the completeness of the classification and the reliability of 
the probability estimate prior to rule induction. There are many other approaches on 
data classification, such as [4, 5, 6]. 

In this paper, we present a new mathematical representation of qualitative concepts 
— Cloud Models. With the new models, a novel approach for data classification in 
large relational databases is proposed. 



2 The Qualitative Strategy 

The laboriousness of the development of realistic DMKD applications, previously 
reported examples of knowledge discovery in the literature and our experience in real- 
world knowledge discovery situations all lead us to believe that knowledge discovery 
is a representation-sensitive, human-oriented task consisting of friendly interactions 
between a human and a discovery system. Current work in DMKD uses some form of 
(extended or modified) SQL as the data mining query language and some variant of 
predicate calculus for the discovered results. The variants frequently contain some 
form of quantitative modifier, such as confidence, support, threshold, and so forth. 
This tends to lead to discovered rules such as : 

With 37.8% of support and 15.7%> of confidence, patients whose age are 
between 20 and 30 and have acute pain on the low-right side of the abdomen 
more than 6.24 hours can be classified as appendicitis. 

Rather than the qualitative representation: 

Generally speaking, young patients who have acute low-right abdominal pain 
for a relative long time may get appendicitis. 

This is, however, more than a simple issue of semantics and friendliness. The 
former rules are not robust under change to the underlying database, while the latter 
ones are. In a real, very large database, data are often infected with errors due to the 
nature of collection. In addition, an on-line discovery system supporting a real 
database must keep up with changing data. It does not make much sense if a very 
precisely quantitative assertion is made about the behavior of an application domain 
based on such a database. It may be necessary, to some extent, to abandon the high 
standards of rigor and precision used in conventional quantitative techniques. At this 
moment, a piece of qualitative knowledge extracted may be more tolerant and robust. 
Clearly, quantitative results such as confidence and support cannot remain constant 
under conditions of any change. By contrast, qualitative representation will remain 
true until there is a substantial change in the database. 

On the other hand, quantitative knowledge discovered at some lower levels of 
generalization in DMKD may still be suitable, but the number of the extracted rules 
increases. In contrast, as the generalization goes up to a higher level of abstraction, 
the discovered knowledge is more strategic. The ability to discovery quantitative and 
yet significant knowledge about the behavior of an application domain from a very 
large database diminishes until a threshold may be reached beyond which precision 
and significance (or relevance) become almost mutually exclusive characteristics. 
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That is to say that, if the generalization to a very large database system exceeds a 
limit, the reality and exactness of its description become incompatible. This has been 
described as the principle of incompatibility. To describe phenomena qualitatively, 
we take linguistic variables and linguistic terms and show how quantitative and 
qualitative inference complement and interact with each other in a simple mechanism. 

Our point of departure in this paper is to represent linguistic terms in logical 
sentences or rules. We imagine a linguistic variable that is semantically associated 
with a list of all the linguistic terms within a universe of discourse. For example, 
“age” is a linguistic variable if its values are “young”, “middle-age”, “old”, “very 
old” and so forth, rather than the real ages which are considered as the universe of 
discourse of the linguistic variable “age," say from 0 to 100. In the more general case, 
a linguistic variable is a tri-tuple {X, T(x), Cx(u)} in which X is the name of the 
variable, T(x) is the term-set of X; that is, the collection of its linguistic values, U is a 
universe of discourse, Cx (u) is a compatibility function showing the relationship 
between a term x in T(x) and U. More precisely, the compatibility function maps the 
universe of discourse into the interval [0,1] for each u in U. 

It is important to understand the notion of compatibility functions. Consider a set 
of linguistic terms, T, in a universe of discourse, U, — for example, the linguistic 
term, “young” in the interval [0,100]. T is characterized by its compatibility function 
Cx : u [0,1]. The statement that the compatibility of, say, “28 years old” with 
“young” is about 0.7, has a relationship both to fuzzy logic and probabilify. 

In relation to fuzzy logic, the correct interpretation of the compatibility value “0.7” 
is that it is an indication of the partial membership to which the element “age-value 
28” belongs to the fuzzy concepf of fhe label “young”. To undersfand the relationship 
with probability on the other hand, the correct interpretation of the compatibility value 
“0.92” is that it is merely subjective indication. Human knowledge does not conform 
to such a fixed crisp membership degree “0.7” af the “28 years old”. There do not 
exist any unique partial membership values, which could be universally accepted by 
human beings to the universe of discourse U. But there is a random variable showing 
that the membership degree at “28 years old” takes a random value, behind which a 
subjective probability distribution is obeyed. The degree of compatibility takes on 
random value itself This type of randomness is adhered to the fuzziness. 



3 Qualitative Representation Based on Cloud Models 
3.1 Cloud Models 

Following the important characteristics of linguistic variables and terms, we define a 
new concept of cloud models to represent linguistic terms. Let U be the set, U = {u}, 
as the universe of discourse, and T, a linguistic term associated with U. The 
membership degree of u in U to the linguistic term T, Cx (u), is a random variable 
with a probability distribution. Cx(u) takes values in [0,1]. A cloud is a mapping from 
the universe of discourse U to the unit interval [0,1]. 
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The bell-shaped clouds, called normal clouds are most fundamental and useful in 
representing linguistic terms, see Fig.l. A normal cloud is described with only three 
digital characteristics, expected value (Ex), entropy (En) and hyper entropy (He). [7] 




Fig. 1. Normal Cloud with digital characteristic Ex=20 En=0.7 He=0.025 

The expected value Ex of a cloud is the position at the universe of discourse, 
corresponding to the center of gravity of the cloud. In other words, the element Ex in 
the universe of discourse fully belongs to the linguistic term represented by the cloud 
model. The entropy, En, is a measure of the fuzziness of the concept over the universe 
of discourse showing how many elements in the universe of discourse could be 
accepted to the linguistic term. It should be noticed that the entropy defined here is a 
generic notion, and it need not be probabilistic. The hyper entropy. He, is a measure 
of the uncertainty of the entropy En. Close to the waist of the cloud, corresponding to 
the center of gravity, cloud drops are most dispersed, while at the top and bottom the 
focusing is much better. The discrete degree of cloud drops depends on He. [6,7] 

3.2 Cloud Generators 

Given three digital characteristics Ex, E„, and He, to represent a linguistic term, a set 
of cloud drops may be generated by the following algorithm[8]: 

Algorithm 1: Normal Cloud Generation 

Input: the expected value of cloud E^, the entropy of cloud E„, 
the hyper entropy of cloud Hg, the number of drops N 
Output: a normal cloud with digital characteristics E„ E„, and 

1) Produce a random value x which satisfies with the normal distribution 
probability of mean = E^ and standard error = E„; 

2) Produce a random value E„ ’ which satisfies with the normal distribution 
probability of mean = E„, and standard error = PI^; 

2 

3) Calculate y = e ^ 

4) Let (x, y) be a cloud drop in the universe of discourse; 

5) Repeat 1-4 until the number of drops required all generated. 

The idea of using only three digital characteristics to generate a cloud is creative. 
The generator could produce as many drops of the cloud as you like (Fig. 1). This 
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kind of generators is called a forward cloud generator. All the drops obey the 
properties described above. Cloud-drops may also be generated upon conditions. It is 
easy to set up a half-up or half-down normal cloud generator with the similar strategy, 
if there is a need to represent such a linguistic term. 

It is natural to think about the generator mechanism in an inverse way. Given a 
number of drops, as samples of a normal cloud, the three digital characteristics Ex, 
En, and He could be obtained to represent the corresponding linguistic term. This kind 
of cloud generators may be called backward cloud generators. Since the cloud model 
represents linguistic terms, the forward and backward cloud generators can be served 
interchangeably to bridge the gap between quantitative and qualitative knowledge. 

4 Classification with Cloud Models 

4.1 Concept Generalization Based on Synthesized Cloud 

Suppose we have two linguistic atoms, Ai(Exi,Eni,Hei) and A2(Ex2,En2,Hc2) over the 
same universe of discourse U. A virtual linguistic atom, A3(Ex3,En3,He3) may be 
create by synthesizing the two using the following definition: 

Ex^ = ( 1 ) 

Eni + Eu2 

En^ = En^ ( 2 ) 

-t-27e2 ac«2 

Hct, — ( 3 ) 

Eh\ +Eh2 

This kind of virtual cloud is called Synthesized Cloud. 

From the geometrical point of view, this definition satisfies the property that the 
torques of A3 (which are the products of the center of gravity of A3and the 
perpendicular distances to the horizontal and vertical axes repectively) are equal to 
the sum of torques of Aj and Aj. The lingustic atom A3 can be considered as the 
generalization of Ai and Aj. 




Fig. 2. Generalizing "about 14" and "about 18" to "teenager" 

To make this concrete, Fig .2 shows how the conceptes "about 14 " and "about 18 " 
may be generalized to that of "teenager". This definition can be conveniently extended 
to the synthesis of many linguistic atoms. The price paid is that the entropy of the 
generalized linguistic atom increases, which reflects the more general information 
coverage. 
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4.2 Reduction of Classification Complexity and Softening 
Thresholds 

In KDD-related problems, the universe U is finite and is highly desirable for it to be 
small. Only finite classifications are “learnable,” i.e., we can potentially acquire 
complete knowledge about such classifications. Unfortunately, most finite 
classifications are not learnable due to the excessively large number of possible 
equivalence classes. Only a small fraction of all possible classifications expressible in 
terms of the indiscemibility relation are learnable. 

To evaluate the computational tractability of the finite classification learning 
problem, we adopt the notion proposed by Ning Shan in [3] — classification 
complexity, defined as the number of equivalence classes in the classification. In 
practice, this number is usually not known in advance. Instead, a crude upper bound 
on the classification complexity for a subset of attributes BcC, can be computed “a 
priori” by the following fomula: 

TC(B,V)= n cardiVp) (4) 

peB 

The quantity TC(B,V) is called the theoretical complexity of the set of attributes B 
given the set of values V of the attributes B. If the number of attributes and the size of 
the domain Vp for each attribute is large, then TC(B,V) grows exponentially large. It 
is very difficult to find a credible classification based on a large number of attributes 
unless the attributes are strongly dependent (e.g., functionally dependent) on each 
other (limiting the number of equivalence classes). 

Complexity reduction increases the credibility of the classification by generalizing 
condition attributes. The information generalization procedure applies attribute- 
oriented concept tree ascension [9,10] to reduce the complexity of an information 
system. It generalizes a condition attribute to a certain level based on the attribute’s 
concept tree, which is provided by knowledge engineers or domain experts. Trivially, 
the values for any attribute can be represented as a one-level concept tree where the 
root is the most general value “ANY” and the leaves are the distinct values of the 
attribute. The medium level nodes in the concept tree with more than two levels are 
qualitative terms, which are expressed in cloud models (see Fig. 3). The data 
corresponds to higher level nodes must cover the data corresponds to all the 
descendant nodes. The transformations between qualitative terms and quantitative 
values of condition attributes are implemented through cloud models. 




Fig. 3. Concept Tree with Qualitative Terms 
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We modified the algorithm proposed by Shan et al. In [3], which extracts a 
generalized information system. In this algorithm, there are two important concepts 
— the attribute threshold and the theoretical complexity threshold, which constrain 
the generalization process. Since the exact values of the thresholds are very hard to 
determine, we apply the linguistic terms to soften them in our modified algorithm. 
The two linguistic terms are represented with cloud models. Since the thresholds are 
not exact values, we call them soft thresholds. The entropy of soft thresholds can 
effectively control cycle numbers. This is a novel contribution of this paper. 

Condition attributes are generalized by ascending their concept trees until the 
number of values for each attribute is less than or equal to the user-specified soft 
attribute threshold for that attribute and the theoretical complexity of all generalized 
attributes is less than or equal to the user-specified soft theoretical complexity 
threshold. For each iteration, one attribute is selected for generalization (this selection 
can be made in many ways). Lower level concepts of this attribute are replaced by the 
concepts of the next higher level. The number of possible values at a higher level of 
an attribute is always smaller than at a lower level, so the theoretical complexity is 
reduced. 

Algorithm 2: Reduction of Classification Complexity 

Input: (1) The original system S with a set of condition attributes C; (l<i <n); 

(2) a set ofHof concept trees, where each H, eH is a concept hierarchy for 
the attribute C,-. 

(3) S ti is a soft threshold for attribute C, with digital characteristic (Exa, Ena, 
Hcii) tind di is the number of distinct values of attribute C,v 

(4) Stc defined by user is a soft theoretical complexity threshold with 
digital characteristic (Extc, En,c, Hefi. 

Output: The generalized information system S’ 

S’ ^S 

Generate soft threshold values Stc <^nd Sa 
while TCj >Stc und Bd-, >Sa do 

Select an attribute C, e C such that di /Sa is maximal 

Ascend tree Eli one level and make appropriate substitutions in S’ 

Remove duplicates from S’ 

Recalculate d. 

Recalculate TC^ = n"./. 

Regenerate soft threshold values Stc ond Sa 
Endwhile 

4.3 Identifying Interacting Attribntes 

The local discovery of interacting attributes has been reported in [11]. All condition 
attributes are grouped into disjoint clusters without considering the decision 
attribute(s). Each cluster contains attributes that are directly or indirectly dependent 
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upon each other. The global discovery method selects a subset of condition attributes 
that are based on their relevance to the decision attribute(s). Here, we adopt the global 
generalization algorithm for attribute clusters in[3], 

4.4 Search for Classifications 

After reduction of complexity and identifying interacting attributes, we search for 
credible classifications of the database tuples based on some selected interacting 
attributes. A classification is credible if it is complete or almost complete with respect 
to the domain from which the database was collected. Here, we adopt the 
SLIQ(Supervised Learning In Quest) method, which was developed by Mehta et 
al.[4]. It is a supervised learning method that constmcts decision trees from a set of 
examples. It uses a novel pre-sorting technique in the free growing phase. This sorting 
procedure is integrated with a breadth-first tree growing strategy to enable 
classification of disk-resident datasets. SLIQ also uses a new tree-pruning algorithm 
that is inexpensive, and results in compact and accurate trees. Since we can 
interchange between qualitative and quantitative representation, the supervising 
process is much easier, and the results are robust. The combination of these 
techniques enables it to scale for data sets with many attributes and classify dafa sefs 
irrespecfive of the number of classes, attributes, and examples. 

4.5 Experiments and Results 

In this section, we discuss the domain of acute abdominal pain, focusing on the 
models used for the diagnosis, which will test and verify our model and algorithms. 
The most serious common cause of acute abdominal pain is appendicitis, and in many 
cases a clear diagnosis of appendicitis is difficult, since other diseases such as Non- 
Specific Abdominal Pain (NSAP) can present similar signs and symptoms (findings). 
The tradeoff is between the possibility of an unnecessary appendectomy and a 
perforated appendix, which increases mortality rates five-fold. The high incidence of 
acufe abdominal pain coupled wifh the poor diagnosis accuracy, make any 
improvements in diagnostic accuracy significant. 

The abdominal pain data used for this study consists of 10270 cases, each with 169 
attributes. The class variable, final diagnosis, has 19 possible values, and the variables 
have a number of values ranging from 2 to 32 values. The resulting database 
addresses acute abdominal pain of gynaecological origin, based on case-notes for 
patients of reproductive age admitted to hospital, with no recent history of abdominal 
or back pain. In compiling the database, the first 202 cases were used in the design of 
the database itself; thus, they cannot be used for the purpose of testing any model. 
Moreover, out of the 10270 cases, the diagnosis of only 8950 cases was definitely 
known (definite diagnoses); the remaining 1320 cases were assigned the best possible 
diagnosis, as a presumed diagnosis. Finally, 120 patients occur more than once in the 
database. 
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Table 1. Classification Results for Acute Abdominal Pain 



Class Variable 


Cloud Models 
Method 


C4.5 


Expert Diagnosis 


Appendicitis 


3707 


3269 


3770 


Stomach disease 


3025 


2750 


3108 


Liver diseases 


636 


567 


669 


Spleen diseases 


304 


288 


310 


Gallbladder diseases 


247 


243 


235 


Small intestine diseases 


236 


235 


240 


Large intestine diseases 


225 


229 


224 


Uterus diseases 


221 


220 


212 


Kidney diseases 


211 


199 


214 


Gallstone 


163 


167 


180 


Duodenitis 


118 


139 


145 


Colonitis 


159 


168 


165 


Caecitis 


138 


150 


156 


Rectitis 


166 


184 


187 


Alimentary intoxication 


134 


141 


145 


Acid intoxication 


77 


84 


87 


Parcreatitis 


54 


61 


68 


Intimitis 


69 


82 


83 


Other diseases 


380 


1094 


72 



Our results show that 2 of 19 classes accounted for almost 67% of the cases, 
whereas each of the other classes accounted for 7% or less of the cases. For each of 
the 2 most common classes, since the probability distribution was induced from many 
cases, our model was significantly better than C4.5 methods (shown as table 1), 
correctly classifying about 89% of the cases. 

On the other hand, on the cases involving the other 17 classes, C4.5 classifier 
performed beffer fhan fhe cloud models approach (not significantly better). This 
because the cloud models could not accurately estimate the complicated distributions 
from so few cases, leading to poor predictive accuracy. 

These results offer some insights into the cloud models. In complex domains with 
many attributes, such as the abdominal pain domain, feature selection may play a very 
important part in classifiers for diagnosis; fhis is especially true when the data set is 
relatively small. In such cases, it is difficult to accurately acquire classification rules 
for the larger data set. Moreover, in domains where there are sufficient cases (as for 
the two main classes in the abdominal pain data set), cloud models method plays very 
well since they can easily model attribute dependencies. However, if the number of 
cases is small, then the simple decision tree method may perform better. 

5 Conclusion 

Data classification is a well-recognized operation in data mining and knowledge 
discovery research field and it has been studied extensively in statistics and machine 
learning literature. We described a novel approach to search for domain classification. 
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The goal of the search is to find a classification or classifications that jointly provide a 
good, in the qualitative terms’ sense, approximation of the interest. We presented a 
new mathematical representation of qualitative concepts — Cloud Models. With the 
models, mapping between quantities and qualities becomes much easier and 
interchangeable. Based on them, we introduced the concept of soft threshold and 
concept tree with qualitative terms. We also developed algorithms for clouds 
generation, complexity reduction, and identifying interacting attributes, etc. After 
classification search, further steps in the qualitative approach to knowledge discovery 
involve classification analysis and simplification, rule induction and prediction, if 
required by any application. These aspects will require a lot of work, and have been 
omitted here, as they will be presented in detail in other publications. 
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Abstract. Clustering geo-referenced data with the medoid method is 
related to fc-MEANS, with the restriction that cluster representatives 
are chosen from the data. Although the medoid method in general pro- 
duces clusters of high quality, it is often criticised for the Q{n^) time 
that it requires. Our method incorporates both proximity and density 
information to achieve high-quality clusters in 0(n log n) expected time. 
This is achieved by fast approximation to the medoid objective function 
using proximity information from Delaunay triangulations. 



1 Introduction 

Geographic data and associated data sets are now being generated faster than 
they can be meaningfully analysed. Spatial Data Mining [7,9,10,13,15,16] aims 
at benefiting from this information explosion. It provides automated and semi- 
automated analysis of large volumes of spatial information associated with a GIS 
and the discovery of interesting, implicit knowledge in spatial databases. 

Gentral to spatial data mining is clustering, which seeks to identify subsets 
of the data having similar characteristics. Glustering allows for generalisation 
analyses of the spatial component of the data associated with a GIS [20]. Glus- 
tering of large geo-referenced data sets has many applications, and has recently 
attracted the attention of many researchers [6,8,10,20,26,27,28]. The volume of 
the data forces clustering algorithms for spatial data mining to be as fast as pos- 
sible at no expense to the quality of the clustering. They must be especially fast 
if the knowledge discovery process is to be an exploratory type of analysis [22], 
where many potential hypotheses must be constructed and tested. 

In spatial settings, the clustering criteria almost invariably make use of some 
notion of proximity, as it captures the essence of spatial autocorrelation and 
spatial association. Proximity is of crucial importance, and is usually evaluated 
by geo-referenced distances. Bottom-up approaches, in which clusters are formed 
by composition of items which are ‘close’ together, accord with the view that 
in geographical application areas, nearby items have more influence upon each 
other. An examples of such an agglomerative approach is DBSCAN [8]. 

However, top-down approaches to clustering are also of interest. The top- 
down perspective defines clustering as partitioning a heterogeneous data set 
into smaller, more homogeneous groups. This amounts to identifying regions of 
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elements which are very different from others in the set. The classical example 
of this is A:-Means, which partitions the data by assigning each data point to 
a representative. The partition optimises a statistical homogeneity criterion — 
namely, the total expected dissimilarity between an item and its representative. 

We present a medoid-hased clustering method that incorporates both prox- 
imity and density information. Medoid-based clustering [10,11,14,20] is similar 
to mean-based clustering, except in that it considers only data points as rep- 
resentative points. The medoid-based approach was introduced to spatial data 
mining [20] with CLARANS (a random hill-climber strategy). CLARANS re- 
quired quadratic time per iteration [20,8] and produced poor partitions [11,19]. 
Since then, medoid-based methods have been introduced which generally pro- 
duce clusters of higher quality than fc-MEANS, and which are robust to out- 
liers, robust to noise and work well with random initialisation [11]. Nevertheless, 
there are two main criticisms of medoid-based clustering: if the number n of 
data points is large, it typically requires 0(n^) for some hill-climbing iterations, 
whereas fc-MEANS requires only 0{n) time; also, as with fc-MEANS, the number 
of clusters k must be specified in advance. Recently, more efficient and effective 
hill-climbers have been developed for the medoids approach in the context of 
spatial data mining [10]. We present further improvements upon these methods. 

The medoid-based clustering algorithm that we present requires only ex- 
pected 0(n log n) time, and does not require specification of the number of 
clusters in advance. Moreover, we achieve this by incorporating proximity in- 
formation using the Voronoi diagram [21] of the data points. The end result is a 
partitioning method for clustering that incorporates density and proximity in- 
formation as in the agglomerative approaches. To our knowledge, this is the first 
time the bottom-up and top-down philosophies have been combined in this way 
for clustering. The algorithm is competitive with recently developed proximity 
methods for clustering, but improves in several aspects: in particular, there is 
no need to specify parameters such as the number or density of clusters. 

The organisation of this paper is as follows. The details of the medoid-based 
approach are presented in Section 2. Section 3 discusses the use of Voronoi 
diagrams in developing a fast algorithm. We also show how the method can 
automatically determine the number of clusters into which the data should be 
grouped. Section 4 is devoted to experimental results, in which our method is 
compared to several others. We conclude with some final remarks in Section 5. 



2 The fc-MEDOlDS Problem 

By starting with a random clustering and iteratively improving it via hill- 
climbing, the A:-Means method approximately optimises the following problem: 

n 

minimise M{C) = E Wjd(sj,rep[si,C]), (1) 

2=1 
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where 

• 5' = {si,...,s„}c 3?"*; is a set of n data items in m-dimensional real space; 

• the weight Wu may reflect the relevance of the observation s„, and the distance 
d{sm Sy) may be a metric such as one of the Minkowski distances. 

• C = {ci, . . . , Cfc} is a set of A: centres, or representative points of 3?™; and 

• rep[si,C] is the closest point in C to s^. 

The partition into clusters is defined by assigning each Si to its representative 
rep[si,C]. Those data items assigned to the same representative are deemed to 
be in the same cluster; thus, the k centres encode the partition of the data. 
Typically, m = 2 (the points are two-dimensional coordinates of GIS data), and 
the Euclidean distance metric (or its squared value) is used. 



The fc-MEDOiDS approach [14,20] attempts to solve the centres problem 
shown in Equation (1), but with the added restriction that C <Z S. This version 
of the combinatorial problem is well known to the operations research commu- 
nity as the p-median problem (the operations research literature uses p instead 
of k for the number of groups) . The p-median problem cannot be expected to be 
solved optimally for the large number of observations that knowledge discovery 
applications typically involve. The problem is NP-hard, and many heuristic ap- 
proaches have been attempted to obtain approximate solutions for moderately 
large n [4,19]. A heuristic by Teitz and Bart [24] has been remarkably successful 
in finding local optima of high quality, in applications to facility location prob- 
lems [4,18,23], and (after some improvements) very accurate for the clustering 
of large sets of low-dimensional spatial data [10], even in the presence of noise 
or outliers. We will refer to this heuristic as TaB. 



TaB starts at a randomly-chosen solution Cq . Letting Ct be the current node 
at time step t, the TaB search is organised in such a way that, in an amortised 
sense, only a constant number of neighbours of Ct are examined for the next 
interchange [10]. When searching for a profitable interchange, it considers the 
points in turn, according to a fixed circular ordering (si, S 2 , . . . , s„) of the points. 
If point Si already belongs to the medoid set, the point is ignored, and the turn 
passes to the next point in the circular list, Si+i (or si if i = n). If is not 
a medoid point, then it is considered for inclusion in the medoid set. The most 
advantageous interchange of non-medoid Si and medoid sj is determined, over 
all possible choices of Sj € Ct- If = {sj}UCt\{si} is better than Ct, then C^ 
becomes the new current solution Ct+i; otherwise, Ct+i = Ct- In either case, 
the turn then passes to the next point in the circular list. If a full cycle through 
the set of points yields no improvement, a local optimum has been reached, and 
the search halts. To compute M{C) on a solution C requires 0{n) time — 
0{n) steps to find rep[s,C'],'is G S (rep[s,C'] is either unchanged or the new 
medoid Si), and 0(n) to compute M(C') as defined in Equation (1). Therefore, 
the time required to test medoids of Ct for replacement by Si is 0(kn) time. In 
most situations, k n, and thus the test can be considered to take linear time. 
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3 Incorporating Proximity and Improving Efficiency 



Although the TaB heuristic is faster than all other known hill-climbers [4], it 
requires at least f2{n?) time to locate a local optima for the /c-Medoids problem 
with n data points. To terminate, TaB requires a pass through the data set in 
which each potential exchange is shown not to improve the value of M{C) [4]. 
In this section, we present a variant of the /c-Medoids approach that requires 
only 0(n log n) time to terminate. The fundamental idea is that the clustering 
measure M{C) for /c-Medoids is a guide to clustering only — its optimisation 
is only the means towards that goal. Although evaluating M{C) exactly requires 
0{kn) time for any given C, we could reduce the complexity of TaB from f?(n^) 
to 0(un) time, if M(C) can be well approximated in 0{un) time for some u. 

How can M(C) be approximated in time sublinear in n if it is a sum of n 
terms? The idea is to consider only the most important contributions to the sum, 
according to what M iC) is meant to measure about a clustering. For each cluster, 
M{C) totals the weighted discrepancies between the points in the cluster and 
their representative. Minimising M{C) is equivalent to minimising M{C)/W, 
where W = weights. The relative importance Wi/W of 

point Si can be interpreted as a probability, perhaps as that of a query for the 
representative of point Si. The minimisation of M(C) can be viewed as an at- 
tempt to minimise the expected distance between a point and its representative. 
Note also that if Kj is the cluster represented by a medoid rep\Kj], then 



M(C) 

W 






Here, M{C)/W consists of k terms, each measuring the expected discrepancy 
(lack of homogeneity) within a cluster. 

The purpose of clustering is to identify subsets, each of whose points are 
highly similar to one another. However, the greatest individual contributions 
to that portion of M{C) associated with a cluster Kj are made by outliers 
assigned to Kj, points which exhibit the least similarities to other points, and 
which generally should not be considered to be part of any cluster. To eliminate 
the contributions of outliers towards the expected discrepancy within clusters, 
we estimate the expected discrepancy amongst non-outlier points only. To do 
this, we limit the contributions to M{C) to those points which lie amongst the 
closest to the medoids of C. Instead of finding a medoid set which best represents 
the entire set of points S', we propose that a medoid set be found which best 
represents the set of points in its own vicinity. 

In order to be able to efficiently determine the set of those points in the 
vicinity of a set of medoids, we preprocess the full set of n points as follows: 

1 . For each point Si G S, we find u points that rank highly amongst the nearest 
neighbours of Si. 

2. Using this information, we construct a proximity directed graph G = (S, E) 
of regular out-degree u and with the set of n points si, . . . , s„ as nodes. The 
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edge set E consists of those pairs of nodes for which Sj is one of the u 

points found to be close to Si in the previous step. 

The adjacency representation of this regular graph has 0(un) size. The value 
of u will be chosen so that uk <^n. 

During the hill-climbing process, whenever TaB evaluates a candidate set C* 
of medoids, only uk data points will be examined. First, only those points that 
are adjacent in the proximity digraph to some medoid Cj C Ct are ordinarily 
allowed to contribute to the evaluation of M{Ct). We call this set P{C) of points 
the party of C. Namely, P{C) = {si G ^Kc^, s^) G E and Cj G C}. However, since 
two medoids in Ct may share neighbours in the proximity graph, the situation 
may arise where fewer than uk points are evaluated, (i.e. ||T’(C't)|| < uk). In 
order for the hill-climbing process not to be attracted to medoid sets where 
fewer than uk points are evaluated, two strategies can be applied to pad the 
number of evaluations out to exactly uk. 

The first strategy fills the quota of uk points by randomly selecting from 
among the remaining points. The second strategy fills the quota from among the 
points of the proximity graph by repeatedly adding the contribution of the point 
which is farthest from its representative medoid, as many times as necessary 
to bring the total number of contributions to M{C) up to exactly uk. More 
precisely, the point chosen is the one that maximises the following expression: 

max max d(sj,c,). 

& Cj6Ct} {siGP(Ct)| rep[Ct,Si]=Cj} 

In our implementations, we have opted for the latter strategy to assure conver- 
gence. Unlike the former strategy, the latter is deterministic, and preserves the 
hill-climbing nature of TaB. 

Using the proximity digraph allows us to obtain an approximation M' {C) to 
M{C) in 0{uk) time. Note that our approach does not restrict the candidates for 
a swap in the search by TaB, but rather it achieves its time savings by effectively 
simplifying the graph for the p-median problem that TaB must solve. Restricting 
swaps in TaB on the basis of their distance is common in the statistical literature, 
but has been shown to result in solutions of poor quality [19]. 

We complete the description of our approach with the details of the proposed 
preprocessing. Given a set of data points S = {si,...,s„} in the plane, the 
Voronoi region of Si G S is the set of points (not necessarily data points) which 
have Si as a nearest neighbour; that is, {a; G yf i,d{x,Si) < d{x,Sj)}. 

Taken together, the n Voronoi regions of S form the Voronoi diagram of S (also 
called the Dirichlet tessellation or the proximity map). 

The Delaunay triangulation 'D(S) of S' is a planar graph embedding defined as 
follows: the nodes of T>{S) consist of the data points of S, and two nodes Si, sj 
are joined by an edge if the boundaries of the corresponding Voronoi regions 
share a line segment. Delaunay triangiilations capture in a very compact form 
the proximity relationships among the points of S. Fig. 1 (a) shows the Delaunay 
triangiilation of a set of 100 data points. They have many useful properties [21], 
the most relevant to our goals being the following: 
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1. If Si is the nearest neighbour of Sj from among the data points of S, then 
(si, Sj) is an edge in D(S'). 

2. The number of edges in T>{S) is at most 3n — 6. 

3. The average number of neighbours of a site Si in T>{S) is less than 6. 

4. The circumcircle of a triangle in T>{S) contains no other point of S. 





Fig. 1. (a) A data set and its Delaunay triangulation, (b) Clustering produced 
using a proximity graph derived from the Delauney triangulation. 



For each data point Si G S, consider the ordering of the points of S' \ {si} 
according to the lengths of their shortest paths to Si within T>{S), from smallest 
to largest. Here, the length of the path is taken to be the sum of the lengths of its 
constituent edges in the planar embedding, according to the distance metric d. 
We claim that the first u sites in this ordering rank very highly amongst the u- 
nearest neighbours of Si. When u = 1, this is certainly true, due to the first 
property of Delaunay triangulations stated above. But even for larger values 
of u, the orderings are still strongly correlated. 

Since the triangulation T){S) can be robustly computed in O(nlogn) time, 
this structure offers us a way to compute a proximity graph of high quality in 
subquadratic time. For each medoid point Si G S, we choose for inclusion in the 
party P(C) its rt-nearest neighbours, according to shortest-path distance within 
Pis). The adjacency list for each Si can be found by means of a modified single- 
source shortest-path search from Sj, stopping when u distinct neighbours are 
discovered. The total time required to find u neighbours for all Si is in 0{un). 
Setting u = 6>(logn) allows the proximity graph to be constructed in O(nlogn) 
total time. Thereafter, each evaluation of M{C) will take 6>(logn) time. Al- 
though in theory, intermediate hill-climbing steps can involve a linear number 
evaluations of M(C), they typically involve only about four full TaB cycles 
through the set of medoid candidates, each cycle taking 6*(nlogn) time. The 
termination step itself will take only 0(nlogn) time. 

Fig. 1 (b) shows the clustering obtained with our algorithm using k = 10. 
The positions of the 10 medoids are indicated with the © symbol. We have found 



Robust Clustering of Large Geo-referenced Data Sets 333 



that our approach does not sacrifice the quality of the clustering. As with k- 
Means, medoid-based clustering uses the number k of groups as a parameter 
supplied by the user. Arguably k is an easier parameter to estimate than the 
density parameters of DBSCAN and STING, and slightly overestimating k is 
usually not a problem with medoid approaches: whenever k is increased by one, 
it usually splits one cluster into two subclusters, or places the extra medoid at 
an outlier. In contrast, /c-Means typically adjusts many clusters. 

Nevertheless, the knowledge discovery process should obtain a robust sugges- 
tion for k. Algorithms minimising M(C) do not search for k for two fundamental 
reasons: (1) The motivation of solving the p-median problem seems to come from 
a facility location application to locate p = k given facilities (and not to find clus- 
ters) and (2) As k increases, M{C) monotonically decreases. Minimising M{C) 
while allowing k to vary results in k = n — every point is a cluster. 

However, the use of T>{S) can rapidly suggest an initial value for k. In fact, 
the proximity information of its dual, the Voronoi diagram, is implicitly encoded 
in the Delaunay triangulation, making T>{S) a useful structure for clustering in 
space [5]. Consider the Delaunay triangulation in Fig. 1. Note that the perimeter 
lengths of the Delaunay triangles seem either large or small, with few in between. 
If three points Sj, Sj, Sk form a Delaunay triangle, the circle passing through 
them is empty of other points of S' — and is therefore likely to be small whenever 
the points belong to a common cluster [21]. The bounding circle in turn limits 
the perimeter of the triangle, and therefore triangles spanning three points in a 
common cluster are likely to have smaller perimeter (but not necessarily smaller 
area) than those spanning points from different clusters. We examine the dis- 
tribution of perimeter lengths among the 2n — 4 triangles, and find a value to 
discriminate between large and small perimeter lengths. We then classify trian- 
gles as either large or small. All triangles of small perimeter are selected, and 
those selected triangles sharing bounding edges are aggregated. The number of 
contiguous ‘patches’ or aggregations is the suggestion for k. We have found that 
a very large range of discriminant values partition into small and large triangles 
leading to high-quality suggestions for the number k of groups. 

4 Comparison and Discussion 

The first agglomerative algorithm to require 0(nlogn) expected time is DB- 
SCAN [8]. The algorithm is regulated by two parameters, which specify the 
density of the clusters. The algorithm places the data points in an i?*-tree, and 
uses the tree to perform u-nearest-neighbour queries (usually u = 4) to achieve 
the claimed performance in an amortised sense. An extra 0(nlogn) expected 
time is taken in helping the users determine the density parameters, by sorting 
all distances between a point and its 4-nearest neighbours, and finding a valley 
in the distribution of these distances. It has been found [26,27] that determining 
the parameters of DBSCAN is difficult for large spatial databases. An alterna- 
tive that does not require input parameters is DBCLASD [27]. Like DBSCAN, 
DBCLASD can find clusters of arbitrary shape, not necessarily convex; how- 
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ever, DBCLASD is significantly slower than DBSCAN. The clusters produced 
by DBSCAN and our approach seem to be of equal quality. Our approach shares 
with DBSCAN the interesting feature that it does not require any assumptions 
or declarations concerning the distribution of the data. We have also found that 
the both require roughly the same computational time, once the parameters of 
DBSCAN have been discovered. Otherwise, DBSCAN must continually ask for 
assistance from the user. The reliance of DBSCAN on user input can be elimi- 
nated using our approach, by using the density of the patches formed by small 
Delauney triangles as the density parameter for DBSCAN. This ultimately im- 
proves DBSCAN, which originally was intended to use a density parameter for 
each cluster, and not a global user-supplied parameter [8]. 

Imposing a grid on the data results in another approach [3,26,28]. Intuitively, 
when a grid is imposed on the data, those grid boxes containing a large number 
of data points would indicate good candidates for clusters. The difficulty for the 
user is in determining the granularity of the grid. Maximum entropy discretiza- 
tion [3] allows for the automatic determination of the grid granularity, but the 
size of the grid grows quadratically in the number of points. Later, the BIRCH 
method saw the introduction of a hierarchical structure for the economical stor- 
age of grid information [28]. The recent STING method [26] combines aspects 
of these two approaches. In two dimensions, STING constructs a hierarchical 
data structure whose root covers the region of analysis. 

In a fashion similar to that of a quadtree, each region has 4 children rep- 
resenting 4 sub-regions. However, in STING, all leaves are at equal depth in 
the structure, and all leaves represent areas of equal size in the data domain. 
The structure is built by finding information at the leaves and propagating it to 
the parents according to arithmetic formulae; for example, the total number of 
points under a parent node is obtaining by summing the total number of points 
under each of its children. When used for clustering, the query proceeds from 
the root down, eliminating branches from consideration because of distribution 
criteria. As only those leaves that are reached are relevant, the data points under 
these leaves can be agglomerated. It is claimed that once the search structure is 
in place, the time taken by STING to produce a clustering will be sublinear. As 
we indicated earlier, determining the depth of the structure (or the granularity 
of the grid) is a challenge. STING is a statistical parametric method. It assumes 
the data is a mixture model and works best with knowledge of the distributions 
involved. These conditions favour methods such as Expectation Maximisation, 
Autoclass [2], Minimum Message Length (MML) [25] and Gibb’s sampling. Re- 
call that DBSCAN and our approach are non-parametric. 

A second problem with STING is that the grid to be imposed on the data 
grows very rapidly with the number of dimensions. Thus, with bidimensional 
points, this limits the number of slabs to if linear time and storage is 

desired. This in turn imposes limitations upon the granularity of the cells in the 
grid. The construction of the hierarchical data structure requires f2{n log n) time 
if there are f?(n) leaves. Note that if an insertion algorithm works with a tree with 
nodes and pointers, then the construction time is proportional to the external 
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path length of the tree. If a linearisation of space is used, then the arithmetic to 
summarise from offspring to parents demands operations with the bit patterns 
of the indices of slabs. This costs l7(logn) per point. STING construction works 
in linear time if the depth is of the structure is restricted to 4 levels [26]. Our 
implementation of STING uses 0(n) space, as it limits the number of cells 
at the bottom level to g = where c is the first integer such that 4'^ > n. 
STING finds clusters that are restricted to orthogonal polygons, and its result 
is a choropletic (two-colour) map where disjoint regions are assumed to indicate 
different clusters. A statistical test on density is made on bottom level cells to 
decide if they are in a cluster or not. In the example of Fig. 1, STING accepts 
only cells with a count of 4 or larger. Thus, only 6 groups are found by STING, 
although the data set has at least 8 obvious clusters. Many points that are clearly 
members of clusters are not labelled as such, because they have been separated 
from their clusters, lying in cells with a low count. This poor clustering reflects 
many of the problems with this type of statistical approach when attempting to 
keep the algorithmic time requirements close to 0(n log n). 

STING is very well suited for an OLAP environment and SQL-type queries 
specifying several characteristics on the attributes on geo-referenced records. In 
such settings, an indexing hierarchical structure such as STING ’s allows for 
the elimination of many sub-nodes on the basis of attributes values, thereby 
arriving at a relatively small number of cells at the bottom level where the 
statistical density tests are performed. In fact, it may be the SQL query itself 
which specifies the minimum density threshold. 

Construction of a dendogram, a type of proximity tree, allows another hier- 
archical approach for clustering two-dimensional points in O(nlogn) time [17]. 
Initially, each data point is associated with its own leaf, and each leaf constitutes 
a cluster with only one member. Iteratively, the pair of nodes representing the 
two closest clusters are joined under a new parent node, and their data points 
together form a new cluster associated with that parent node. The two clusters 
are deleted from the pool of available clusters, and the new merged cluster is 
added to the pool. The process terminates when only one cluster remains. Un- 
fortunately, it is unclear how to use the proximity tree to obtain associations 
from distinct data layers in a GIS [8]. 

There are two classical approaches to finding the number k of groups: AU- 
TOCLASS [2] and Minimum Message Length(MML) [25]. Both demand a dec- 
laration of a probabilistic mixture model. AUTOCLASS [2] searches for the 
classes using a Bayesian statistical technique. It requires an explicit declaration 
of how members of class are distributed in the data in order to form a probabilis- 
tic class model. AUTOCLASS uses a variant of Expectation Maximisation, and 
thus, it is a randomised hill-climber similar to fc-MEANS or TaB, with additional 
techniques for avoiding local maxima. Similarly, MML methods [25] require the 
declaration of a model for which to describe the data, in two parts. The first part 
is an encoding of parameters of the mixture model; the second is an encoding 
of the data given the parameters. There is a trade-off between the complexity 
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of the model and the quality of fit. Also hard optimisation problems must be 
solved heuristically when encoding parameters in the fewest number of bits. 

For an approach that does not assume a predefined mixture model, Ng and 
Han [20] proposed to run CLARANS once for each fc, from 2 to n. For each 
of the discovered clusterings, the silhouette coefficient [14] is calculated. The 
clustering of maximum silhouette coefficient is chosen, determining the number 
of classes. This is prohibitively expensive for large values of n. 



5 Final Remarks 

We have presented a clustering method which both exhibits the characteristics 
of density-based clustering (like DBSCAN) and is fully autonomous. It does 
not demand the declaration of a density model from the user, and because it 
uses a medoid-based approach is robust to outliers and noise. All this is achieved 
within 0(n log n) expected time. 

Recently, the Knowledge Discovery and Data Mining perspective on cluster- 
ing has generally been that of scaling algorithms for a density estimation problem 
solved through /c-Means [1] or Expectation Maximisation [12]. In particular, 
much effort has been focused on the sensitivity of fc-MEANS and Expectation 
Maximisation on the set of representatives used to initialise the search [1]. 

The medoids approach is robust with respect to random initialisation. How- 
ever, researchers have recently identified a set of desiderata for clustering meth- 
ods [1,12] with which our medoid approach complies. Namely, the clustering 
method should be stoppable and resumable, with the capacity to obtain a 
clustering solution at any time, and to be able to improve on the quality of 
the solution given more computational resources. Also, since clustering is cen- 
tral to spatial generalisation, it has been suggested [26] that clustering meth- 
ods should find groups directly from the data; this favours medoids over fc- 
MEANS. 
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Abstract. Clustering in large database is an important and useful data 
mining activity. It expects to find out meaningful patterns among the 
data set. Some new requirements of clustering have been raised : good 
efficiency for large database; easy to determine the input parameters; 
separate noise from the clusters [1]. However, conventional clustering al- 
gorithms seldom can fulfill all these requirements. The notion of density- 
based clustering has been proposed which satisfies all these requirements 
[1]. In this paper, we present a new and more efficient density-based clus- 
tering algorithm called FDC. The clustering in this algorithm is defined 
by an equivalence relationship on the objects in the database. The com- 
plexity of FDC is linear to the size of the database, which is much faster 
than that of the algorithm DBSCAN proposed in [1]. Extensive perfor- 
mance studies have been carried out on both synthetic and real data 
which show that FDC is the fastest density-based clustering algorithm 
proposed as far. 



Keywords: Clustering, Density-base method. Spatial Database 

1 Introduction 

Both the number of databases and the amount of data stored in a database 
are increasing rapidly in recent years. There are lots of useful information hid- 
den in the data. Therefore, automatic knowledge discovery in database becomes 
more and more important. Clustering — the grouping of objects into meaningful 
classes — is one of the important data mining tasks. 

Data mining has raised new requirements for clustering [1]: 

— Good efficiency for large database : there may be millions of objects inside the 
database. Clustering algorithm should scale up well against large database, 
and hence should not be memory-based. 

— Reduce dependence on input parameters : because appropriate values for the 
parameters are often difficult to determine. For example, we usually do not 
know apriori the number of clusters. 
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— Able to separate noise from the data : there may exist some points in the 
database that do not belong to any meaningful cluster — the algorithm 
should be able to separate these noise points from the clusters. 

None of the conventional clustering algorithms can fulfill all these require- 
ments. Recently, some new algorithms have been developed to improve the 
efficiency of clustering in large database : these include algorithms such as 
CLARANS [2,3], BIRCH [4] and CURE [5]. However, these algorithms are not 
designed to handle noise, and many of them need to assume the number of 
clusters before starting the clustering. 

A density-based algorithm DBSCAN is proposed in [1,6]. It requires the 
user to specify two parameters to define the density threshold for clustering: 
Eps, the radius of the neighborhood of a point; MinPts, the minimum number 
of points in the neighborhood. Clusters are then found by starting from an 
arbitrary point, if its Eps-neighborhood contains more than MinPts points, then 
all the points in the Eps-neighborhood belong to the same cluster. The process is 
repeated on the newly added points until all points have been processed. Points 
that cannot be absorbed into any cluster are noise points. The notion of density- 
based cluster can fulfill the aforementioned requirements on clustering for mining 
useful patterns in large databases [1]. DBSCAN uses an R*-tree to manage 
all the points and to compute neighborhood of a data point. Its complexity 
is at least on the order of NlogN, where N is the size of the database. In 
this paper, we introduce a new density-based notion of clustering in which the 
clusters are equivalent classes of a binary relationship defined between objects 
in the database. This notion gives a more precise definition than that in [1]. We 
also propose an algorithm called FDC to compute density-based clusters. The 
complexity of FDC is linear to N and hence is much faster than DBSCAN. 

The rest of the paper is organized as follows. In Section 2 , we present our 
notion of density-based clustering. Section 3 introduces the algorithm FDC. The 
performance studies of FDC are presented in Section 4, the results are compared 
with DBSCAN. We use synthetic data as well as real data in our studies. Finally, 
we conclude our paper in Section 5. 



2 A Density-Based Notion of Clusters 

We first give a brief overview of the notion of density-based clustering as defined 
in [1]. Given a set of objects (or points) in a database D, a distance function 
dist on the objects, a neighborhood radius Eps, and a threshold MinPts on the 
minimum number of objects in the neighborhood, the notion of clustering in [1] 
is defined as the following: 

Definition 1. (directly density-reachable) An object p is directly density-reachable 
from an object q if 

- p G Nspsiq), where Nspsiq) = ^ D,dist{q,t) < Eps}, 

~ \NEps(g)\ > MinPts (core point condition). 
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Definition 2. (density-reachable) An object p is density-reachable from an object 
q if there is a chain of objects p\ = q, p2, ■ ■ ■ , Pn = p, such that p,_|_i is directly 
density-reachable from pi, \/i = 1 , ■ • • ,n — 1 . 

Definition 3. (density-connected) An object p is density- connected to an object 
qif 3 o £ D such that both p and q are density-reachable from o. 

A cluster is defined as a set of density-connected objects which is maximal 
with respect to density-reachability. The objects not contained in any cluster are 
the noise. 

Definition 4: (cluster) A cluster C with respect to Eps and MinPts in D is a 
non-empty subset of D satisfying the following conditions: 

— Maximality: Vp, g £ D, if q £ C and p is density-reachable from q, then 
p£C. 

— Connectivity: Vp, q £ C, pis density-connected to q. 

Based on the above notion of cluster, it can be concluded that a cluster Cin 
D can be grown from an arbitrary core point in C by absorbing the points in its 
£^ps-neighborhood into (7 recursively. DBSCAN is developed based on this idea. 

The density-connected relationship is symmetric but not transitive, while 
density-reachable is transitive but not symmetric. The notion of clustering in 
Definition 4 is a partitioning on objects in D with respect to the relationships 
density- connected and density-reachable. However, these two relationships are 
not compatible with each other. In some case, there is a contradiction between 
the Maximality and Connectivity conditions. In Fig. 1, there are two clusters of 
points Cl and C2 touching at their boundary. There exists some point p in the 
boundary area, which is density-reachable from a core point qi in Ci and from 
another core point 52 in the second cluster C2. However, p itself is not a core 
point because of not enough neighborhood points. Let us assume that there is 
no chain connecting the points in Ci to C2, i.e., they are not density-connected. 
With respect to this situation, we would expect p to belong to both clusters Ci 
and C2. If the Maximality condition holds, then Ci and C2 should be merged 
into one cluster. On the other hand, the points in Ci are not density-connected 
to those in C2, hence they shouldn’t be together in one cluster. This anomaly 
shows that the two relationships density-reachable and density-connected are 
not quite compatible with each other. Algorithmically, DBSCAN would assign 
the point p to the cluster which is generated first, and the two clusters are hence 
separated at the boundary near p. 

Cl C2 

p density-reachable from ql 
p also density-reachable from q2 
ql and q2 are not density-connected 




Fig. 1. Clustering Anomaly 



We found that the above anomaly can be resolved by defining a new ” connec- 
tivity” notion called ’’density-linked” which is an equivalent relationship. More 
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importantly, we found out that density-based clusters can than be defined as 
equivalent classes with respect to this equivalent relationship. This enhanced 
notion of connectivity has given density-based clustering a simple and sound 
theoretical base. 

To simplify the representation, we denote the relationship that object p is 
directly density-reachable from qhy p <= q. We redefine the notion of connectivity 
by the density-linked relationship. 

Definition 5 (density-linked) The binary relationship density-linked (with 
respect to Eps > 0 and MinPts > 1 in set D) is defined as the following: 

— \/o £ D,o o] 

— 'ip,q G D,p ^ q, then p q iS 3p = 01 , 02 , ••• ,o„ = q G D such that 
\/i = 1, • ■ ■ ,n — 1, either o, 4= o,+i or o,_|_i 4= o,. 

From the definition, we can directly deduce the following important lemma: 
Lemma 1: The relationship density-linked with respect to given Eps and 
MinPts in set D is an equivalent relationship. 

For a non-empty set D, the equivalent relationship can be used to deter- 
mine an unique partitioning of D. In fact, we can define density-based clustering 
by this partitioning. 

Definition 6 (cluster & noise) Suppose tt = Ci , C 2 , • • • , Cm are the equivalent 
classes on D defined by the relationship with respect to Eps and MinPts . 

— if \Ci\ > 1, then C, is a cluster; 

— if |Cj| = 1 and o £ Ci, then o is a noise. 

A cluster C defined above will contain at least MinPts points. Because C will 
contain at least two points p and q. By definition, either p <= q or q p. Hence 
one of them must be a core point and its Flps-neighborhood will contain at least 
MinPts points, which will all be in the cluster C. 

Besides the merit that the clustering is defined by a simple equivalent re- 
lationship, we also have resolved the above mentioned anomaly : our definition 
allow a cluster to include two (or more) disjoint (not density-connected) sets of 
core points which share a thin layer of border points. 

Before introducing our clustering algorithm, we present a theorem which 
characterises the basic computation in the algorithm. It provides a criteria to 
determine the clustering property of every object in D. (The clustering property 
of an object is the information of whether it is a noise or which cluster does it 
belongs to.) 

Theorem 1 Given a data set D and the relationship density — linked ^ with 
Eps and MinPts. \fo £ D, its clustering property can be determined by the 
following rule: 

— if |W£;p 5 (o)| = 1, o is a noise; 

— if |W£;p 5 (o)| > MinPts, o is a core point of a cluster, and all objects in 
]^Eps{o) belong to the same cluster; 

— otherwise, let T = {p\p £ Neps{o), \Neps{p)\ > MinPts}, if T 0 than o 
and all objects in T belong to the same cluster; else, o is a noise. 
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Following Theorem 1, we need to compute the Neps{o) for every point o in 
the database to determine its clustering property. If either condition 1 or 2 is 
true, then the point’s clustering property has been determined; otherwise, we 
can skip the point temporary. Eventually one of the two possibilities in case 3 
will become true : either it is found in the Eps-neighborhood of some other core 
point, or it is a noise. Note that, unlike DBSCAN, in Theorem 1 no recursion is 
required to compute the clustering property of the objects. We will show in the 
next section that a linear algorithm can be developed to compute all the clusters 
based on this theorem. 



3 The FDC Algorithm for Density-Based Clustering 



In this section we present the efficient algorithm FDC (Fast Density-Based Clus- 
tering) for density-based clustering defined by the density-linked relationship. 
The main task is to compute efficiently the Fps-neighborhood for every point 
in the database. We assume that the total size of the objects are too large to be 
processed together in the memory. 

FDC is a multi-stage algorithm. It first stores all the data points in a k-d 
tree. This partitions the whole data space into a set of rectangular subspaces, 
each represented by a leaf node on the tree. We assume that the subspaces on the 
k-d tree are small enough to be processed in the memory. Secondly, we search 
the Fps-neighborhoods for the points in each subspace (leaf node) separately. 
Those points whose clustering property can be determined entirely within the 
subspace are first identified and removed from the memory. The remaining points 
are those that are near the split boundary of the node. In order to find out all 
the neighboring points for these boundary points, cross boundary searching is 
performed across the split boundary at the internal nodes of the k-d tree. Note 
that after a leaf node is processed, only a smaller number of boundary points 
need to be maintained in the memory to support cross boundary searching. 

The first step of FDC is to build a k-d tree on the database. As an example, 
in Fig. 2, a 2-dimensional data space is partitioned into 7 subspaces representing 
by the leaf nodes rii, • • • ,ri 7 in the corresponding k-d tree. 





Fig. 2. K-d tree from FDC 

After building the k-d tree structure, FDC performs a postorder bottom- 
up traversal on the tree. At a leaf node, FDC performs a cell-based procedure 
NodeClustering to determine the clustering property for all the points in the 





A Fast Algorithm for Density-Based Clustering in Large Database 



343 



node except those that are near the split boundary (boundary points). When 
FDC traverses to a non-leaf node, it uses the cross boundary searching procedure 
CrossClustering to compute the £^ps-neighborhoods for the boundary points 
found previously in its two child nodes. Cross boundary searching at an internal 
node is performed on the split boundary in the k-d tree. Once FDC reaches the 
root node of k-d tree, the clustering property of all the points can be determined. 

In order to manage the cross boundary searching, we define the cross clus- 
tering order ( CCO) of a node. It is an ordering of the boundaries of a node which 
determines the order of performing cross boundary searching at the node. For 
example, the CCO of node ns in Fig. 2 is < Ly, Lx,Uy,Ux >■ (We use and 
Ux to denote the lower and upper X-axis boundaries respectively. ) In fact, the 
CCO of a node is the reverse of the split order when the node is being generated. 
For example, node rii has two split boundaries, and Uy is generated before Ux- 
Therefore, the CCO of node n\ is < Ux,Uy >. 

We present the algorithm FDC in Fig. 3 and explain some details in the 
following subsections. 

/* Input: D: database ; Eps, MinPts: thresholds; Output: Clusters of D. */ 

1) scan the database D once to build a k-d tree Tr ; 

for each point p 6 D, p.CID = UNCLASSIFIED-, 

2) initialize the global variable CidSeP, 

3) traverse Tr in a postorder and for each node Node do { 

4) if Node is a leaf node then N odeClustering{Node, Eps, MinPts)-, 

5) if Node is an internal node then 

CrossClustering{Node, Lbps, Kbps, Eps, MinPts)-, 

/* Lbps, Kbps are the boundary points found in the two child nodes */ 

6) } 

7) scan D and re-label all the points with cluster ids determined in CidSet-, 

Fig. 3. Algorithm FDC 

3.1 NodeClustering : a cell-based clustering procedure 

NodeC luster ing is used to perform clustering in each leaf node. The most time 
consuming task is to find out the neighbors for all points inside the subspace 
of the node. The simplest way is to compute the distance between any pair of 
points in the node and this would require an O(N^) computation, where 
is the number of the points inside the node. FDC has developed a cell-based 
approach whose time complexity is linear to the number of points. 

NodeClustering (see Fig. 4) first divides the subspace into equal sized rect- 
angular cells whose length on each dimension is equal to Eps . All data points in 
the subspace are mapped into these cells. Since we can control the size of the 
leaf node, it is reasonable to assume that both the data set of the node and the 
multi-dimensional cell array can be fit into the main memory. 

Let C{ci,C 2 , - - - ,Cd) be a cell where Ci,i = 1, - - - ,d are the cell coordinates 
and i is dimension index. We use NC(C{c \ , C 2 , • • • , c^)) = {C(ai, as, - - - , = 

Ci±l,i = 1, • • • , d; and C(ai, - - - , 04 ) does not exceed the boundary of the node} 
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to denote the set of immediate neighboring cells of C. A typical non-boundary 
cell C (those that does not touch the boundary of the node) has 3*^ immediate 
neighboring cells in NC{C). For any points in C, their £^ps-neighboring points 
must be located within the cells in NC{C). Therefore, we only need to check 
the 3*^ cells in NC{C) in order to find out all neighboring points for any point in 
C . Note that NodeClustering needs to maintain the boundary points for future 
cross boundary searching. 

/* Input: Node: a leaf node on the k-d tree; Node.D is the set of data points; 

Node.S is the subspace; and Node.CCO is the cross clustering order. 
Eps, MinPts: the clustering parameters. 

Output: RCS: the points whose clustering property have been determined; 

BP5[]: the boundary points on each split boundary */ 

/* prepare for clustering */ 

1) quantize Node.S by mapping all the points in Node.D into a cell array CS 

determined by Eps; 

/* clustering * / 

2) for each non-empty cell c in CS do{ 

3) for each point p in c do { 

4) calculate Nspsip) NC(c); 

5) if \Neps(p)\ > MinPts then 

6) AssignClusterID(N£ps(p)); 

7) } 

8) } 

/* dispatch points * / 

9) for each point p in Node.D do { 

10) for each boundary bi in Node.CCO =< bi, - ■ ■ ,bm > do 

11) if p is close to boundary bi then 

12) { add p to BPS\bi]\ goto 9) }; 

13) if p is not close to any boundary in Node.CCO then 

14) if p.CID ^ UNCLASSIPIED then save p.CID on disk 
/* the cluster id of p has been identified * / 

15) else report p as a noise; 

16) } 

Fig. 4. Procedure NodeClustering 

At step 5, if \Neps{p)\ is less than MinPts, according to Theorem 1, if € 
Nspsip) such that g is a core point, then p, q would belong to the same cluster, 
otherwise, p is a noise. In fact, if the condition is true, q will be identified as a 
core point when it is being processed at step 5 and p will be assigned to the same 
cluster as q. If not, p will not be assigned to any cluster, and will be maintained 
as a boundary point at step 12 or reported as a noise at step 15. 

Assign cluster id. At step 6, we need to assign a set of points P (points in 
the neighbor hod of a core point) to a unique cluster, but some of the points in the 
set may have been assigned different cluster ids already. However, these points 
are in fact density-linked, even though there cluster ids are different. Therefore, 
FDC needs to mark these different cluster ids as equivalent. It uses a global 
object CidSet to generate and maintain cluster ids. Any two cluster ids which 
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are assigned to density-linked points will be marked equivalent in CidSet. Once 
all points are processed, all equivalent cluster ids in CidSet can be identified. 
At the end, FDC scan the database again and re-assign unique cluster ids to all 
the points according to the equivalent ids in CidSet (step 7 in FDC in Fig. 3). 

Identify boundary points. At step 11 of procedure NodeClustering, we 
need to determine the boundary points. There are two types of boundary points. 
Given a split boundary 6, of a node, the type one boundary points are those whose 
distance from 6, is less than Eps. For this type of boundary points, we need to 
perform cross boundary searching to find out points on the other side of the 
boundary that fall in their Flps-neighborhoods. Type two boundary points are 
those in the set {p\p.CID = UNCLASS FI ED, and 3q € NEps{p),q is a type 
one boundary point}. These are points which have not been assigned to any 
cluster but are close to a type one boundary point. If a type one boundary point 
turns out to be a core point after a cross boundary searching, then the type two 
boundary points close to it will be absorbed into the same cluster. 

We can find out type two boundary points by the following procedure: (1) 
the distance from the point to the boundary is between Eps and 2Eps] (2) the 
point is not a core point (otherwise, it would have been assigned to a cluster); (3) 
none of the points in its Sps-neighborhood has been identified as a core point. A 
type two boundary point will become a noise unless a type one boundary point 
close to it becomes a core point in a cross boundary searching. 

A boundary point is dispatched only to one boundary for the purpose of cross 
boundary searching. If a point is close to more than one boundaries, the bound- 
ary appears first in the CCO of the node will be selected for cross boundary 
searching. For example, the CCO of the node ri 2 in Fig. 2 is < Uy, Lx,Ux >■ H 
a points in ri 2 is close to both and Uy, it will be dispatched to Uy for cross 
boundary searching. 

3.2 CrossClustering : cross boundary searching at an internal node 

The procedure CrossClustering is applied on internal nodes to compute the 
Flps-neighborhoods of the boundary points founded in its child nodes (step 5 of 
FDC in Fig. 3). It combines the type one boundary points on the two sides of a 
split boundary from the child nodes. CrossClustering again uses the cell-based 
method to identify the area of searching for the boundary points. 







nl 





n3 



n2 



Y;y3 



X:x2 



Fig. 5. The procedure CrossClustering 



An example is presented in Fig. 5 to illustrate the procedure CrossClustering. 
The procedure is first applied to the internal node Y : ys in Fig. 2. The cross 
boundary searching is done between its two child nodes ri 2 and ris. The cells 
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containing boundary points at Ly of ris and Uy of ri 2 are involved in the search- 
ing. In the next level up, the procedure is applied to the internal node X : X 2 , 
boundary points at Ux of rii and those at Lx oiY : are involved in the cross 

boundary searching at X : X 2 - During the process, any type one boundary point 
whose i?ps-neighborhood has enough points will be promoted to a core point. In 
addition, all type two boundary points in its immediate neighboring cells will be 
assigned the same cluster id. At the end of the process, all remaining unassigned 
boundary points become noise. 

3.3 Complexity of FDC 

In this section we analyse the complexity of FDC. Since k-d tree is very common, 
we skip the analysis of the cost of building the tree. However, we would like to 
note that it is cheaper to build a k-d tree than a R*-tree. Also, according to our 
performance study, the time spent in building the k-d tree is only a small part 
of the total cost in the clustering. 

I/O cost: FDC needs to perform two passes on the k-d tree. The first is 
to perform Node Clustering in the leaf nodes and Cross Clustering in the internal 
nodes. The second pass is to re-label all the points with their unique cluster ids. 
Therefore, the I/O cost is on the order of scanning the database twice. 

CPU cost: For every point p £ D, no matter whether it is a boundary 
point or a non-boundary point, it only needs to compute its distances from the 
points in its 3*^ immediate neighboring cells. Since Bps would be rather small, it 
is reasonable to assume that the number of points in each cell are bounded by a 
constant. Therefore, the complexity of computing the Fps-neighor hoods for all 
the points is on the order of 3*^ x A^, where N is the number of points in the 
database. Since d is in general a small integer, the time complexity of FDC is 
linear to the size of the database. 



4 Performance Studies 

We have carried out extensive performance studies to investigate FDC on a 
Sun Enterprise 4000 share-memory computer. The machine has twelve 250 MHz 
Ultra Sparc processors, running Solaris 2.6, and IG main memory. We use both 
synthetic data and real database in the experiments. Our goal is to compare 
FDC with DBSCAN in various environment. FDC is implemented in C, and the 
code of DBSCAN is provided by its author. We compared efficiency of the two 
algorithms with respect to the following parameters: 

— database size 

— number of clusters 

— number of dimensions 

— percentage of noise points in the database 

The efficiency is measured by response time. Since building the R*-tree in 
DBSCAN is more complicated than the k-d tree in FDC, in order to make a fair 
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comparison, we have excluded the time to build the trees in the two algorithms. 
Also, our results have shown that the cost of building the k-d tree is only a small 
fraction of the total cost of FDC. However, the cost of building the R*-tree is a 
substantial part of the total cost in DBSCAN. 

4.1 Synthetic data generation 

The synthetic data are generated by a 2-step procedure controlled by the pa- 
rameters listed in Table 1. In the first step, pre-defined (size and shape) non- 
overlapping clusters are randomly generated in the data space. In the second 
step, points are generated within each cluster with a uniform distribution. The 
total number of points generated in the clusters is N x {1 — ps). An additional 
N X ps points are randomly added to the data space as noise points. In all the 
studies, except otherwise mentioned explicitly, the percentage of noise Ps is set 
to 10%. 



Table 1. Data generation parameters 



d 


number of dimensions in the data space 


Li 


length of each dimension of the data space, i = 1, ■ ■ ■ ,d 


Nc 


number of clusters 


Shape j 


shape and size of each cluster, j = 1, ■ ■ ■ , Nc 


N 


number of data points 


Ps 


percentage of noise points 



4.2 Results on the synthetic data sets 

Database size: Our first experiment examines the effect of the database size. 
We fix the size of a 3-dimensional data space and the number of clusters is set to 
10. The number of data points varies from 0.5 million to 3 million. Fig. 6 shows 
the result : FDC is significantly faster than DBSCAN by a factor between 55 
and 70. As predicted, the response time of FDC is linear to the database size. In 
contrast, the response time of DBSCAN increases much faster than the database 
size. This shows that the linear growth of the complexity of FDC guarantees that 
it is a more efficient algorithm than DBSCAN. 




Fig. 6. Effect of N Fig. 7. Effect of Nc 

Number of clusters: Fig. 7 shows the result of varying the number of 
clusters from 1 to 100. One million data points are generated in a 3-dimensional 
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data space. The results shows that FDC consistently outperforms DBSCAN. 
As the number of clusters increases, the number of core points decreases, and 
the total cost of searching would decrease. Therefore, the response time of both 
algorithms go down slightly when the number of cluster increases. 

Number of dimensions: To examine the scalability of FDC against the 
number of dimensions, we vary the number of dimensions from 2 to 6. The 
number of data points are fixed to 1 million with 10 clusters. Fig. 8 shows that, 
even though the response time of FDC does increase exponentially against the 
increase in the dimension, it increases much slower than DBSCAN. 

Percentage of noise points: In this experiment, we study the effect of 
the percentage of noise points. We vary the percentage from 1% to 25%. The 
data space has 1 million points and 10 clusters. Fig. 9 again shows that FDC 
is consistently better than DBSCAN. The response time of both algorithms 
decrease when the percentage increases — when the number of noise points 
increases, the number of data points in clusters decreases, and it is cheaper to 
identify a noise point than a point in a cluster. 
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Fig. 8. Effect of d 



Fig. 9. Effect of ps 



4.3 Real data results 

We also have studied the efficiency of FDC on a real data set, the SEQUOIA 2000 
benchmark data [7]. This benchmark uses real data sets that are representative 
of Earth Science tasks. We use the point data in the database in our experiment. 
The performance studies on DBSCAN was also done on the same benchmark 
[1]. We extract 20% to 100% of the points from the database to compile a series 
of data sets of different sizes. The result of comparing EDC with DBSCAN is 
shown in Table 2. It shows that the response time of EDC increases linearly to 
the size of data base. It outperforms DBSCAN by a factor from 42 to 50. 



Table 2. Response times on real database (sec.) 



Number of points 


12511 


25022 


37533 


50044 


62555 


FDC 


2.23 


5.04 


7.77 


10.18 


12.72 


DBSCAN 


94.09 


217.57 


348.81 


476.07 


632.21 



In summary, the performance studies clearly demonstrates that EDC is su- 
perior than DBSCAN. In particular, its response time increases linearly to the 
size of the database while that of DBSCAN increases much faster. 
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5 Discussion and Conclusion 

We believe that clustering in large database to identify interesting patterns is 
an important and useful data mining activity. However, conventional algorithms 
suffer from severe drawbacks when applied to large database. The density-based 
notion of cluster is useful in extracting meaningful patterns in large database. 
In particular, it can separate noises from the meaningful clusters. 

In this paper, we have made the following contributions: 

— We have defined a new density-based notion of cluster which is derived from 
an equivalence relationship. 

— We have developed a linear algorithm FDC for computing density-based 
clusters in large database. 

— We have performed an in-depth performance study on both synthetic and 
real data. The result clearly demonstrates that FDC is significantly more 
efficient than the previously proposed algorithm DBSCAN. 

Future research will consider the following issues. The effectiveness of the 
algorithm is sensitive to the parameters Eps and MinPts. A mechanism to de- 
termine the parameters would be very useful. Enhance the scalability of FDC in 
high dimensional database is also a challenging problem. 

Acknowledgment: We would like to thank Martin Ester for providing the 
code of DBSCAN and the SEQUIOA 2000 point data set. 
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Abstract. This paper presents a lazy model-based algorithm, named DBPredic- 
tor, for on-line classification tasks. The algorithm proposes a local discretization 
process to avoid the need for a lengthy preprocess stage. Another advantage of 
this approach is the ability to implement the algorithm with tightly-coupled SQL 
relational database queries. To test the algorithm’s performance in the presence 
of continuous attributes an empirical test is reported against both an eager model- 
based algorithm (C4.5) and a lazy instance-based algorithm (k-NN). 



1 Introduction 

The large number of structured observations now stored in relational databases has cre- 
ated an opportunity for classification tasks that require a prediction for only a single 
event. This type of prediction will be referred to as on-line classification, to differen- 
tiate it from classification tasks that allow for batch style model induction. This paper 
proposes an algorithm, called DBPredictor, for such on-line classification tasks. The 
algorithm performs a top-down heuristic search through the I F antecedent THEN con- 
sequent rule space of the specific event to be classified. The two challenges addressed 
in this paper are the local discretization of numerical attributes and a tightly-coupled 
implementation against SQL based databases. 

Section 2 contrasts the lazy model-based approach to classification form other well 
known approaches. Section 3 describes DBPredictor with a focus on the support for 
numerical attributes and an SQL interface. Section 4 presents the results of an empiri- 
cal investigation into DBPredictor’s accuracy with respect to the number of numerical 
attributes. Finally, Section 5 concludes with a paper summary. 

2 Previous Work 

An algorithm for on-line classification tasks must decide whether to use an eager or lazy 
approach, and whether it should be model-based or instance-based. Eager algorithms 
induce a complete classification structure (classifier) before they can process any clas- 
sification requests. Lazy algorithms, on the other hand, commence to work immediately 
on classifying the given event [3]. Model-based algorithms, represent their result in a 
language that is richer than the language used to describe the dataset, while instance- 
based algorithms represent their result in the same language that is used to described 
the dataset [11]. 
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Based on these descriptions, on-line classification tasks would likely benefit from a 
lazy model-based algorithm. Such an algorithm would focus its effort on classifying the 
particular event in question, and would also return a rationale that may help the person 
interpret the validity of the prediction. Two recent proposals that do make use of a lazy 
model-based approach include of former version of DB Predictor [8] and the LazyDT [6] 
(Lazy Decision Tree) algorithms. The main difference between these algorithms is the 
use of a rules in one and the use of decision tree paths in the other. 

These two algorithms however, cannot be tightly-coupled with a SQL based rela- 
tional database [2, 7] because they require datasets to be both discretized and stored in 
memory. The version of DBPredictor presented in this paper addresses both these is- 
sues and presents a tightly-coupled implementation of the SQL Interface Protocol (SIP) 
proposal [2, 7]. Other details of this algorithm are presented in [9]. 



3 Algorithm 

DBPredictor requires three input parameters: a partially instantiated event e; the at- 
tribute whose value is to be predicted A^', and a dataset D from the same domain as e. 
With this information, DBPredictor performs a search through a constrained space of 
all the applicable I F antecedent THEN consequent classification rules. The search starts 
from a general seed rule that covers e. The algorithm then generates several candidate 
rules that are more specialized than the previous rule. To determine which rule to fur- 
ther specialize on, the candidate rules are tested with a heuristic function F(). The rule 
that achieves the highest value at each specialization step is the one selected for the next 
round of specialization. The search proceeds until a stopping criterion is encountered. 



3.1 top_down_search ( ) 

Once the seed rule ro has been composed, DBPredictor simply outputs the rule re- 
turned by a call to top_down_search with parameters (ro, {D, e, Ac)). This proce- 
dure performs a greedy top-down search through a constrained rule space. Procedure 3.1 
presents a pseudo-code overview of top_down_search ( ) . Only two of its four sub- 
procedures are described in detail in the coming sections: generate_antecedents ( ) , 
and get_consequent ( ) . The heuristic function F() can be any impurity measure. 
Several heuristics, such as entropy and Euclidean distance have been successfully tested 
in [9]. Finally, best_rule ( ) selects the rule with the highest estimated predictive 
value. 



3.2 generate_antecedents ( ) 

The generate_antecedents ( ) procedure returns the set of rules to be tested by 
the heuristic function. To constrain the search space from the 2" possible attribute- value 
combinations, the procedure returns only the, up to n — 1 rules possible by specializing 
on each of the n attributes. 
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Procedure 3.1 top_down_search ( ) 

Input: (r, P): rule r and algorithm parameters P <:= (D, e,c). 

Output: A rule r' that covers e but which cannot be further specialized. 

Method: 

1: R <:= generate_antecedents (r, P) 

2: for all rule r' G i? do 

3: r’conseq^ <^= get_consequent(r' , P) 

4: rvalue Hr' ,r) 

5: end for 
6 : 

7: hest-T -i= best_rule(i?) 

8: if ihestjr' / 0) then 
9: return(top_down_search(feesf_r', P)) 

10: else 

11: return(r) 

12: end if 



Symbolic Attributes The first time a specialization is attempted on a proposition that 
refers to a symbolic attrihute Ai, the proposition is simply updated from (A,- = ANY) 
to (Ai = Si). If the proposition on this attribute has already been specialized, then no 
further specialization is attempted. 



Numerical Attributes The method described above for symbolic attributes cannot be 
successfully applied to continuous attributes. If, for example, = 6.5 and the range 
on attribute A^ is [0.5, 9.0], then generating the proposition (A 2 = 6.5) would likely 
result in a rule that covers few, if any, records in the dataset. One way to overcome this 
situation is by discretizing all continuous attributes in the dataset before using the clas- 
sification algorithm. This approach could be thought of as eager discretization because 
many of the regions that are made discrete are not necessary for the single classification 
task at hand. DBPredictor instead uses a two sided test (Ai G [e,- — b, e,- -\- (5]), where 
(5 > 0, on continuous attributes. After a scan through the dataset to locate the min, max 
range for each attribute, he b for each proposition in the seed rule is set to the larger of 
(max — Si) and (e, — min). The proposition in the example above would be initialized 
to(A2 e [6.5- 6, 6.5 +6). 

At each specialization DBPredictor makes 6' strictly smaller than the previous 6 
used by the parent’s proposition. The decrease in 6 between iterations is determined by 
an internally set fraction named num_ratio (numerical partitioning ratio): 



num_ratio 

Pi 5= G [e, — 6' , e, -1- (5^]^ 



An empirically determined default value for num_ratio is presented in Section 4. 
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3.3 get_consequent ( ) 

Once a set of rule antecedents have been generated, the get_consequent ( ) proce- 
dure is used to construct each rule’s consequent. For a given rule’s antecedent (r. antecedent), 
the procedure executes the following SQL query to return a summary of attribute 
for all the dataset records that match the antecedent: SELECT Ac, COUNT ( * ) FROM 
D WHERE r. antecedent GROUP BY . This implementation has the advantage that 
it does not require the usage of temporary tables nor the existence of a key attribute in 
the dataset. 



4 Empirical Results 

An empirical study was conducted to test DBPredictor’s accuracy with respect to the 
proportion of continuous attributes in the dataset.' Twenty three datasets from the UCI 
repository [10] were used in this study: anneal, heart-h, audiology, hepatitis*, breast- 
w, horse-colic, chess, iris*, credit-a, letter, credit-g, liver-disease, diabetes, mushroom, 
echocardiogram*, segment, glass, soybean-small, hayes-roth*, tic-tac-toe*, heart, vote, 
and heart-c. An attempt was made to include previously studied datasets [6, 5, 12] with 
a wide variety of sizes and proportions of continuous attributes^. The benchmark algo- 
rithms for this study were the C4.5 r8 decision tree algorithm [11, 12] and the IBl k- 
nearest neighbor algorithm [4]. Finally, each algorithm’s true error rate on each dataset 
was estimated with the use of multiple ten-fold stratified cross-validation tests. 

4.1 Tuning nmn_ratio 

The first portion of the empirical study determined an appropriate value for DBPredic- 
tor’s num_ratio internal parameter. All three algorithm’s were tuned on five datasets 
marked with a * beside its name. The value for num_ratio that achieve the lowest 
average error rate on these five datasets was 1.5. This value was passed on to the next 
study. Similarly, the IBl algorithm’s k was set to 5 after being tuned on the same five 
datasets as DBPredictor. 



4.2 Continuous Domains 

To test for bias with respect to the proportion of numerical attributes in a dataset the 
three pairwise combinations between DBPredictor, C4.5 and IB I were contrasted. For 
each pair, the datasets in which one algorithm performed significantly better^ than the 
other were identified. Table 1 summarizes the results of the three pairwise tests. Rather 
than being skewed to a strong bias for or against continuous attribute, the results suggest 
that DBPredictor’s bias is instead situated between that of IB I and C4.5. 

' An ANSI-C implementation of the algorithm can be downloaded from the 
www.es . sfu . ca/fnelli /DBPredictor Web site. 

^ the 23 datasets possessed on average 48% numerical attributes and ranged from 0% to 100% 

^ based on a two-tailed t-test with 99.5% confidence 
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Table 1. Average percentage of numerical attributes in the datasets that each algorithm performed 
significantly more accurately than another algorithm. The datasets that DBPredictor was more 
accurate than IBl had 39% numerical attributes on average. 



DBP/C4.5 


DBP/IBl 


C4.5/IB1 


54%/55% 


39%/52% 


20%/57% 



5 Conclusion 

This paper presents an algorithm, named DBPredictor, that is targeted to on-line classi- 
fication tasks. These tasks require the prediction of a single event’s class, based on the 
records stored in a relational database. DBPredictor uses a lazy model-based approach 
in that it performs only the work it requires to classify the single event. The opportu- 
nity presented in this paper is the use of proposition specialization. A tightly-coupled 
SQL based implementation of the algorithm is presented, along with the results of an 
empirical study into the relative bias for or against datasets with numerical attributes. 



References 

1 . AAAI. Thirteenth National Conference on Artificial Intelligence. AAAI Press, 1996. 

2. R. Agrawal and J. C. Shafer. Parallel mining of association rules: Design, implementation, 
and experience. IEEE Trans. Knowledge and Data Engineering, 8:962-969, 1996. 

3. D. W. Aha, editor. Lazy Learning. Kluwer Academic, May 1997. 

4. D. W. Aha, D. Kibler, and M. K. Albert. Instance-based learning algorithms. Machine 
Learning, 6(l):37-66, 1991. 

5. P. Domingos. Unifying instance-based and rule-based induction. Machine Learning, 
24(2):141-168, August 1996. 

6. J. H. Friedman, R. Kohavi, and Y. Yun. Lazy decision trees. [1], pages 717-724. 

7. G. H. John and B. Lent. SIPping from the data firehose. In Proceedings, Third International 
Conference on Knowledge Discovery and Data Mining, pages 199-202. AAAI Press, 1997. 

8. G. Melli. Ad hoc attrihute-value prediction. [1], page 1396. 

9. G. Melli. Knowledge based on-line classification. Master’s thesis, Simon Fraser University, 
School of Computing Science, April 1998. 

10. P. M. Murphy and D. W. Aha. UCI repository of machine learning databases. Irvine, 
CA: University of California, Department of Information and Computer Science, 1995. 
ftp://ics.uci.edu/pub/machine-leaming-databases. 

11. J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. 

12. J. R. Quinlan. Improved use of continuous attributes in C4.5. Journal of Artificial Intelli- 
gence Research, 4:77-90, March 1996. 





An Efficient Space-Partitioning Based Algorithm 
for the K-Means Clustering* 

Khaled AlSabti^, Sanjay Ranka^, and Vineet Singh^ 

^ Department of CS, King Sand University 
alsabtiSccis . ksu . edu . sa 
^ Department of CISE, University of Florida 
rankaOcise .uf 1 . edu 
® Hitachi America Ltd. 
vineet . singhOalumni . stanford.org 



Abstract, k- means clustering is a popular clustering method. Its core 
task of finding the closest prototype for every input pattern involves ex- 
pensive distance calculations. We present a novel algorithm for perform- 
ing this task. This and other optimizations are shown to significantly 
improve the performance of the k-means algorithm. The resultant algo- 
rithm produces the same (except for round-off errors) results as those of 
the direct algorithm. 



1 Introduction 

Clustering has been a widely studied problem in a variety of application domains 
including data mining. Several algorithms have been proposed in the literature. 
The partitioning based clustering algorithms partition a dataset into a set of k 
clusters. Each cluster is represented either by the center of gravity of the cluster 
(as in k-means) or by one of the objects of the cluster located near its cen- 
ter [2]. Two competing clustering algorithms are compared based on the quality 
of clustering achieved, and the time and space requirements for the computation. 
Depending on the application requirements, one or more of the above features 
may be relevant. 

The k-means clustering algorithm has been shown to be effective for many 
practical applications. However, its direct implementation is computationally 
very expensive, especially for large datasets. We propose a novel algorithm for 
performing k-means clustering. The new algorithm uses a space-partitioning 
based technique that deals with subsets of patterns collectively in determin- 
ing the cluster membership. Each subset of patterns represent a subspace in the 
patterns space. It organizes the patterns in a partitioned-based structure (e.g., 
k-d tree). This can then be used to find all the patterns which are closest to a 
given prototype efficiently. The new algorithm has a significantly superior per- 
formance than the direct algorithm. 



* A large part of this work was done while the authors at the Information Technology 
Lab (ITL) of Hitachi America, Ltd. 
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Further, it produces the same (except for round-off errors due to limited preci- 
sion arithmetic) clustering results. It is also competitive to other state-of-the art 
methods for improving the performance of k-means clustering. 

2 The k-means Clustering Method 

Several forms of k-means have been studied in the literature [4,2]. In this paper 
we adapt Forgey’s method [2]. k-means is an iterative technique which starts 
with an initial clustering (e.g., generated randomly), and improves the quality 
of the clustering as it iterates. This process is halted when a stopping criterion 
is met; e.g., the cluster membership no longer changes or the quality of the 
clustering is not changed significantly. Clustering partitions the input patterns 
into k (user-defined) disjoint subsets. Each cluster is represented by a prototype 
pattern such as its centroid. The quality of the clustering is improved by updating 
the assignment of patterns to clusters (i.e., cluster membership). Each iteration 
of the direct algorithm consists of three steps: 

— Step 1: Updating the cluster membership to improve the overall quality. A 
pattern is assigned to a cluster when its prototype is the nearest prototype 
for the given pattern. This involves calculating the distance between every 
pattern to every prototype. In this paper we use Euclidean distance as the 
distance function. This step is the computationally intensive part of the 
algorithm; it requires 0{nkd) time, where n is the number of patterns, and d 
is the dimensionality of the pattern. 

~ Step 2: Computing the new set of prototypes by finding the centroid of each 
cluster, and it takes 0{nd) time. 

— Step 3: Evaluating the stopping criteria. Using the standard mean squared 
error for the stopping criterion requires 0(nd) time. 

The number of iterations required can vary from a few to several thousands 
depending on the problem instance. The appropriate choice of fc is a problem 
and domain dependent. Generally, the user has to try several values [2]. The 
partitioning based clustering algorithms, such as k-means, are generally very 
sensitive to the initial clustering, and the user generally tries several initializa- 
tions [1]. Thus, a direct implementation of k-means can be computationally very 
intensive, especially for applications with large datasets. 

Two main approaches, described in the literature, can be used to reduce 
the computational requirements of k-means: prototype-based, and membership- 
based. In the former, the prototypes are organized in a suitable structure so that 
finding the closest prototype for a given pattern becomes more efficient [1] . This 
approach is best used when the prototypes are fixed. However, in this paper 
we are assuming that the prototypes will change dynamically. Hence, many of 
these optimizations are not directly applicable. The latter technique uses the 
information of the cluster membership from the previous iteration to reduce 
the number of distance calculations. P-CLUSTER is a k-means algorithm which 
exploits the fact that the change of cluster membership is relatively few after 
the first few iterations [3] . It also uses the fact that the movement of the cluster 
centroids is small for consecutive iterations (especially after a few iterations). 
These optimizations are orthogonal to our approach (see Section 4) . 
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3 The New Algorithm 

Our algorithm partitions the patterns space into disjoint, smaller subspaces for 
collectively finding the closest prototype for a subset of patterns representing 
a subspace. The main intuition behind our approach is as follows. All the pro- 
totypes are potential candidates for the closest prototype for all the patterns. 
However, we may be able to prune the candidate set by: (1) partitioning the 
space into a number of disjoint, smaller subspaces, and (2) using simple geomet- 
rical constraints. Clearly, each subspace will potentially have different candidate 
sets. Further, a prototype may belong to the candidate sets of several subspaces. 
This approach can be applied recursively until the size of the candidate set is one 
for each subspace. At this stage, all the patterns in the subspace have the sole 
candidate as their closest prototype. As a result of using the above approach, 
we expect to significantly reduce the computational requirements of the three 
steps of the k-means algorithm. This is because the computation has to be per- 
formed only with subspaces (representing many patterns) and not the patterns 
themselves in most cases. The improvements obtained using our approach are 
crucially dependent on obtaining a good pruning method. 

In the first phase of the new algorithm, we build a k-d tree to organize the 
patterns (detailed implementations are presented elsewhere [1]). The root of 
such a tree represents all the patterns, while the children of the root represent 
subsets of the patterns completely contained in smaller subspaces. For each node 
of the tree, we keep the number of patterns, the linear sum of the patterns, and 
the square sum of the patterns. In the second phase, the initial prototypes are 
derived (as in the direct algorithm). In the third phase, the algorithm performs 
a number of iterations (as in the direct algorithm) until a stopping criteria is 
met. For each cluster i, we maintain the number of patterns, the linear sum of 
the patterns, and the square sum of the patterns. In each iteration, we start 
from the root node with all k candidate prototypes. At each node, we apply a 
pruning function (described below) on the candidate prototypes. If the number of 
candidate prototypes is equal to one, the traversal below that internal node is not 
pursued. All the patterns belonging to this node have the surviving candidate as 
the closest prototype. The patterns-clusters assignment can be guaranteed to be 
the same to those of the direct algorithm by avoiding overpruning the candidate 
prototypes. The cluster statistics are updated based on the information stored 
in that internal node. 

One iteration of the direct algorithm (or any other algorithm) is applied 
on a leaf node with more than one candidate prototype. This process uses the 
candidate prototypes and the patterns of the leaf node. At the end of each 
iteration, the new set of prototypes is derived and the error function is computed. 
These computations can efficiently be performed while deriving the exact (or 
similar due to the round-off error) results as those of the direct algorithm [1]. 

The cost of the pruning is the extra overhead in the above process. However, 
effective pruning may decrease the overall number of the distance calculations 
which decreases the overall time requirement. Clearly, there is a trade-off between 
the cost of the pruning technique, and its effectiveness. We used the following 
pruning strategy at each internal node of the tree: 
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— For each candidate prototype, find the minimum and maximum distances to 
any point in the subspace 

~ Find the minimum of maximum distances, call it MinMax 

— Prune out all candidates with minimum distance greater than MinMax 
The above strategy guarantees that no candidate is pruned if it can potentially 

be closer than any other candidate prototype to a given subspace. Thus, our algo- 
rithm produces the same (except for round-off errors) clustering results to those 
of the direct algorithm. Our pruning approach is a conservative approach and 
may miss some of the pruning opportunities. However, it is relatively inexpen- 
sive, and can be shown to require a constant time per a candidate prototype [I]. 
Choosing a more expensive pruning algorithm may decrease the overall number 
of distance calculations. This may, however, be at the expense of higher overall 
computation time due to an offsetting increase in cost of pruning. 

4 Experimental Results 

Since the clustering quality of the new algorithm is the same (except for round- 
off errors) clustering results to those of the direct algorithm, we have measured 
the performance of the new algorithm in terms of the distance calculations and 
the total execution time. We also compare our algorithm with P-Cluster. We 
have designed a hybrid algorithm in which the new algorithm is applied for the 
top levels of the tree, and P-CLUSTER is applied to the leaf nodes. All the 
experiments are conducted on an IBM RS/6000 running AIX version 4, with a 
clock speed of 66 MHz and a memory size of 128 MByte. We present represen- 
tative results. For more details, the reader is referred to [1]. Our performance 

evaluation of the algorithms is based on the following measures: 

~ FRD: the factor of reduction in distance calculations with respect to those 

of the direct algorithm 

— ADC: Average number of the distance calculations performed per pattern 
in each iteration 

— FRT: the factor of reduction in overall execution time with respect to those 
of the direct algorithm 

FRD and ADC reflect on the intrinsic quality of the algorithm and are relatively 
architecture and platform independent. 



Table 1. Description of the datasets; range along each dimension is the same 
unless explicitly stated 



|Dataset|| Size | Dimensionality | No , of Clusters | Characteristic | Range 



DSl 


100,000 


2 


100 


Grid 


[-3,41] 


DS2 


100,000 


2 


100 


Sine 


[2. 632], [-29, 29] 


DS3 


100,000 


2 


100 


Random 


[-3, 109], [-15, 111] 


R4 


256k 


2 


128 


Random 


[o5T 


R8 


256k 


4 


128 


Random 


[o5T 


R12 


256k 


6 


128 


Random 


[0,1] 



We used several synthetic datasets (see Table 1). The three datasets DSl, DS2 
and DS3 are described in [5]. For the datasets R1 through R12, we generated k 
(16 or 128) points randomly in a cube of appropriate dimensionality. For the ith 
point we generated points around it, using uniform distribution. These 

resulted in clusters with a non-uniform number of points [1]. 
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Table 2 and other results (in [1]) show that our algorithm can improve the 
overall performance of k-means by an order to two orders of magnitude. The 
average number of distance calculations required is very small and can vary any- 
where from 0.46 to 11.10, depending on the dataset and the number of clusters 
required. 



Table 2. The overall results for 10 iterations 



Dataset 


k 


The time of the 
Direct Algorithm (in seconds) 


Our Algorithm 


i'otal T ime (in seconds ) | h RT | hRD |ADU 


DSl 


64 


23.020 


2.240 


10.27 


54.72 


1.19 


DS2 


64 


22.880 


2.330 


9.81 


43.25 


1.50 


DS3 


64 


23.180 


2.340 


9.90 


52.90 


1.23 


R4 


64 


64.730 


3.090 


20.94 


139.80 


0.46 


R8 


64 


89.750 


8.810 


10.18 


24.15 


2.69 


R12 


64 


117.060 


29.180 


4.01 


5.85 


11.10 



Table 3 compares our algorithm to P-CLUSTER and the hybrid algorithm. 
We can draw the following conclusions from this table and other results (in [1]): 
(1) The performance of our algorithm is better than P-CLUSTER for almost all 
cases for small number of iterations. For larger number of iterations and larger 
dimensional data, the performance of P-Cluster is better. This behavior is par- 
tially attributed to the heuristics used by P-CLUSTER, which are more effective 
after the first few iterations. The performance improvements of our algorithm do 
not change significantly for larger number of iterations since our pruning strat- 
egy does not optimize across iterations, and (2) The hybrid algorithm captures 
the best qualities of both the algorithms. It outperforms or is comparable to the 
two algorithms. 



Table 3. The ADC produced by the three algorithms for 10 iterations 



Dataset 


k 


16 


64 


128 


J4-cLUSTt;R 


Our Alg. 


Hybrid 


P-OLUsTER 


Our Alg. 


Hybrid 


P-CLUsTRR 


Our Alg. 


Hybrid 


DSl 


4.66 


0.51 


0.51 


6.59 


0.96 


0.99 


7.80 


1.39 


1.49 


DS2 


3.51 


0.25 


0.25 


5.55 


0.64 


0.66 


8.28 


1.48 


1.56 


DS3 


4.02 


0.41 


0.42 


7.83 


1.08 


1.12 


7.99 


1.87 


1.98 


R4 


5.39 


0.32 


0.32 


4.04 


0.46 


0.49 


5.78 


1.27 


1.22 


R8 


5.73 


1.87 


2.13 


7.62 


2.68 


3.19 


8.57 


3.92 


3.85 


R12 


7.12 


7.68 


6.58 


10.70 


11.10 


8.63 


12.00 


15.92 


9.56 



5 Conclusion 

We have presented a novel technique for improving the performance of the k- 
means algorithm. It can improve the performance of the direct algorithm by an 
order to two orders of magnitude, while producing the same (except for round-off 
errors) results. Further, the new algorithm is very competitive to P-CLUSTER. 
The performance of our algorithm is better than P-CLUSTER for almost all 
cases for small number of iterations. For larger number of iterations, we have 
developed and presented a hybrid algorithm (using P-CLUSTER). This hybrid 
algorithm is empirically shown to require less (or comparable) number of distance 
calculations as compared to P-CLUSTER. 
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Abstract. Identifying outliers and remainder clusters which are used 
to designate few patterns that much different from other clusters is a 
fundamental step in many application domain. However, current outliers 
diagnostics are often inadequate when in a large amount of data. In 
this paper, we propose a two-phase clustering algorithm for outliers. In 
Phase 1 we modified k-means algorithm by using the heuristic ”if one new 
input pattern is far enough away from all clusters’ centers, then assign 
it as a new cluster center” . So that the number of clusters found in this 
phase is more than that originally set in k-means algorithm. And then 
we propose a clusters-merging process in the second phase to merge the 
resulting clusters obtained in Phase 1 into the same number of clusters 
originally set by the user. The results of three experiments show that the 
outliers or remainder clusters can be easily identified by our method. 



1 Introduction 

Cluster analysis could be defined as the process of separating a set of pat- 
terns(objects) into clusters such that members of one cluster are similar. The 
definition of the term ’’remainder cluster” [2] is used to designate few patterns 
that much different from other clusters. They are often to be seemed as noises 
like outliers. Therefore, in most traditional clustering algorithms, the patterns 
are either neglected or given a lower weight to avoid the data being clustered. 

However, in some specific applications we have to find out the abnormal pat- 
terns from a large amount of data. For example, in the medicine domain, we may 
want to find extraordinary cases of patient data; in network problem domain, 
we may want to find the abnormal behaviors from log data. The above problems 
would not be easily solved by traditional clustering algorithms, because the data 
often have more than ten thousand records. So using traditional clustering algo- 
rithm to cluster the minor and abnormal patterns will either cost much time or 
not work well. Even data mining approaches can find potential relations in large 
amount of data, but few for this goal. 

In this paper, a two-phase clustering algorithm for outliers is proposed. In 
the first phase, we propose a modified k-means algorithm by using a heuristic 
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”if one new input pattern is far enough away from all clusters’ centers, then 
assign it as a new cluster center” . It results that the number of clusters found in 
this phase is more than that originally set in k-means algorithm. And then we 
propose a clusters-merging algorithm in the second phase to merge the resulting 
clusters obtained in Phase 1 into the same number of clusters as that originally 
set by the user. 

The merging algorithm first constructs a minimum spanning tree for these 
clusters and set it as a member of a forest. Then remove the longest edge of 
a tree from the forest and replace the tree with two newly generated subtrees. 
And repeatedly remove the longest edge from the forest until the number of trees 
is sufficient enough. Finally the smallest clusters are selected and regarded as 
outliers base upon our heuristic ’’the smaller the cluster is, the more probability 
the outlier has” . 

The structure of this paper is as follows. Section 2 presents the related work of 
outlier detection, and clustering methods. Section 3 gives a detailed description of 
our algorithm in order to find outlier data. Finally, we present some experimental 
results on different data to show the performance of our algorithm and conclusion 
in section 4. 



2 Related Work 

For the selection of variables and identification of outliers, several robust algo- 
rithms have been proposed for detecting multiple outliers [3] [4] which can pro- 
duce estimates by giving up the outliers for the purpose of inference. Among the 
methods suggested for detecting outliers, three popular models will be briefly 
introduced as follows. 



~ Bayesian model[5][8]: Bayesian model averagingly provides optimal predic- 
tive ability. Its discriminant analysis compares certain linear combinations 
of posterior densities of feature vector with respect to the classes considered. 
However, this approach will not be practical in many applications due to 
large number of models for which posteriors need to be computed. 

— Hierarchical clustering model[6]: In hierarchical clustering model, single link- 
age merges clusters based on the distance between the two closest observa- 
tions in each cluster, and the results of the clustering can be seen as ’’cluster 
tree” . The clusters which are added later can be seen as outliers or remainder 
clusters. However, since the time complexity is O(n^), it is not suitable for 
large amount of data. 

— Nearest neighbor model[l]: Nearest neighbor model, like hierarchical clas- 
sification, is a simpler pattern classification method. Its computational ex- 
pensiveness requires a reduction in the number of reference patterns. The 
algorithm is very simple with satisfactorily high learning speed while its 
convergence is only local; hence precise arrangement of initial position of 
prototypes is needed for its convergence. 
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In data-mining applications where data tends to be very large, e.g., 100 
thousand records, an important factor is the computing resources required (CPU 
time and memory space) . Resource consumption varies considerably between the 
different methods, and some methods become impractical for all but small data 
sets, e.g., hierarchical clustering methods are in O(n^) for memory space, and 
O(n^) for CPU time[7]. However, our modified k-means algorithms generally 
have a time and space complexity of order n, where n is number of records in 
the data set. 

3 Our Method 

We first use the modified k-means algorithm to cluster a large quantity in a 
coarse manner and also to allow the amount of clusters more than k. The noise 
data may be clustered into an independent cluster with higher probability rather 
than using traditional algorithm. In the second phase, a minimum spanning tree 
is built according to the distance of edges where each node of the tree represents 
the center of each cluster obtained in Phase 1. Then we remove the longest k-1 
edges and the left k clusters are not much balance partitioned as other algorithm 
does. The algorithm of our clustering method is divided into two phases: 

— Phase 1. Modified k-means process 
~ Phase 2. Clusters-merging process 

3.1 Modified k-means (MKM) 

We use a heuristic of ’’jump” , which is splitting the outlier as another cluster 
center in cluster iteration process. This heuristic is permitting to adjust the 
number of clusters. Like ISODATA algorithm[9], adding more clusters will help 
us to find out potential outliers or remainder cluster which is not conspicuous. 
Let k’ be the number of adjustable clusters. Initially, set k’ = k, and randomly 
choose k’ patterns as cluster centers, C={zi, Z 2 , ..., Zfe'} in i?„. In the period of 
iteration, when adding one pattern to its nearest cluster, we compute not only 
the new cluster’s center but also the minimum distance min(C) between any two 
clusters’ centers. So, for any pattern Xi, the minimum distance min(a;i ,C) of its 
nearest cluster’s center is computed by 

min{xi, C) = min\\xi — Zj\\'^ forj = 1, . . . , k' , (1) 

min{C) = min\\zj - zu\\^ forj = I, . . . ,k' J ^ h. (2) 

According to the above heuristic, more clusters can be found after the split- 
ting of data. In some extreme case, each pattern is put into its own cluster, i.e., 
k’ = m. This will result in too many clusters. So the number kmax is defined as 
the maximum number of cluster we may afford. In the modified k-means algo- 
rithm, when the patterns are split into kmax+^ clusters, two closest clusters will 
be merged to be one. The Modified k-means Algorithm is as follows: 
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Step l:Randomly choose k’ (k’ = k) initial seeds as cluster centers. 

Step 2:For i <— 1 to m do : 

If min(a::i ,C) > min(C) then go to Step 3. Else go to Step 5. 

Step 3:Assign Xi to be the center of a new cluster Ck' and set k’ <— k’+l. 

Step 4:If k’ > kmax, merge the closest two clusters into one and set k’=kmax- 

Step 5:Assign xi to its nearest cluster. 

Step 6:Go to Step 2 until the cluster membership stabilizes. 

From Step 2 to Step 4, we compare the distance with ’’the minimum distance 
between pattern to nearest cluster” and ’’the minimum distance among clusters” . 
We use the heuristic ”if one pattern is far enough, then assign it as a new clus- 
ter” . It is important to note that the splitting and merging operation employed 
here are quite similar to the splitting and merging operations proposed in ISO- 
DATA[9]. However, ISODATA sets parameters by user to determine splitting 
or merging. Thus, it is not suitable when user do not know both the property 
and the scatter of the patterns. So in our scheme splitting performs due to the 
condition at that time, and merging performs when too many clusters occur. 
So it is not necessary to thoroughly know the distribution of the data before 
clustering. 



3.2 Clusters-Merging Process 

Most cluster techniques use density as criteria to determine the result of clus- 
tering is good or not. But there is no need in our method to split a large cluster 
into several partitions to reduce the density. So the criteria of our method to 
project outlier data, like the criteria of hierarchical techniques, depends on dis- 
tances. In Phase 2 we propose a clusters-merging process to merge the resulting 
clusters obtained in Phase 1. We first construct a minimum spanning tree for 
these clusters and set it as a member of a forest F. Then remove the longest edge 
of a tree from the forest and replace the tree with two newly generated subtrees. 
And repeatedly remove the longest edge from the forest until the number of trees 
in F equals to k. 

Finally, we select the smallest clusters from F, which may be regarded as 
outliers, according to our heuristic ’’the smaller the cluster is, the more proba- 
bility the outlier has” . We like to note here that a cluster would be regarded as 
outliers which can be further improved by the help of some domain experts. The 
algorithm is as follows: 

Step l:Stepl. Construct an MST by the centroids of set C and add it to F. 

Step l:For i ^ 1 to k-1 do 

Find and remove a tree with the longest edge among all the edges in F. 

Step l:Select the smallest clusters from F, which can be regarded as outliers. 
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4 Experiments and Conclusion 

To compare the clustering performance obtained by our outliers clustering pro- 
cess and k-means algorithm, two different experiments have been implemented 
on Iris data and e-mail log. The Iris data set is a well-known benchmark which 
has some noises. 




(a) (a') (b) (b') 



Fig. 1. The result of first experiment. Fig.l(a)(b) are obtained by k-means al- 
gorithm while l(a’)(b’) are obtained by our outliers clustering process. 



In Figs. l(a)(a’) we choose 1st and 2nd attributes of Iris data as two input 
parameters. Fig. I (a) is the results of k-means algorithm while (a’) is by our 
process and so as Fig.l(b)(b’). In e-mail log, there are 25511 entries are analyzed. 
We can find that our process can obvious cluster ourliers as independent clusters. 
The outliers or remainder clusters not only can be easily identified but also can 
be computed more quickly. 
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Abstract. The study develops a new method for finding the optimal 
non-Euclidean distance metric in the nearest neighbour algorithm. The 
data used to develop this method is a real world doctor shopper classifica- 
tion problem. A statistical measure derived from Shannon’s information 
theory - known as mutual information - is used to weight attributes in 
the distance metric. This weighted distance metric produced a much bet- 
ter agreement rate on a five-class classification task than the Euclidean 
distance metric (63% versus 51%). The agreement rate increased to 77% 
and 73% respectively when a genetic algorithm and simulated annealing 
were used to further optimise the weights. This excellent performance 
paves the way for the development of a highly accurate system for de- 
tecting high risk doctor-shoppers both automatically and efficiently. 



1 Background 

Established in 1974, the Health Insurance Commission of Australia (HIC) ad- 
ministers Australia’s universal health insurance programs. The HIC has a fiducial 
responsibility to protect the public purse and to ensure that taxpayer’s funds 
are spent wisely on health care. A major problem, in recent times, are the large 
number of patients who consult many different doctors (‘doctor-shop’) in order 
to obtain high volumes of prescription drugs. These patients are at risk of ob- 
taining volumes of medicine far in excess of their own current therapeutic need, 
either for sale or to support their own drug habit. 

2 Methodology 

Three general rules were used to screen the entire Australian population of 18 
million people in order to identify patients who may be doctor-shoppers. Of the 
13,000 people identified as possible doctor-shoppers using these rules, a random 
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sample of 1,199 people was carefully scrutinised by an expert clinical pharma- 
cologist. Using 15 features to profile each patient’s behaviour over a year-long 
period, the expert classified these patients on a 5-point scale according to the 
likelihood that they were doctor-shoppers. This scale ranged from class 1 (pa- 
tients most likely to be doctor-shoppers) to class 5 (patients least likely to be 
doctor-shoppers). This classified sample of 1199 patients was then divided ran- 
domly into two groups: a training set of 800 patients and a test set of 399 patients. 
The training set was used to set the parameters of a nearest neighbour (NN) 
algorithm. The performance of the NN algorithm was then measured on the test 
set. The aim of the present study was to achieve the best possible agreement 
rate on the test set. Once this is achieved, the NN algorithm can be applied 
to the unclassified samples of 13,000 patients who were identified as possible 
doctor-shoppers using simple rules. 

3 Nearest Neighbour (NN) algorithm 

The NN algorithm (Dasarathy 1991) is a class of learning algorithms that uses 
a simple form of table look-up to classify examples. In the nearest neighbour 
algorithm used in the current study, a new case is assigned the classification of 
its nearest neighbour. The distance (squared) between (a new case) and xj, 
(a known case) is defined by equation (1): 

k 

d{xj,,xj^) = ( 1 ) 

i=l 

where x'j^ is the ith feature of case and w, is the weight assigned to the zth 
feature. The following approach was used to evaluate the performance of various 
methods for optimising the feature weights, ivi. First, each case in the training set 
is assigned a ‘predicted classification’ using the ‘leave-one-out’ method. Weights 
in the NN distance metric are then set in order to minimise the number of mis- 
classifications in the training set. The 399 cases in the test set are then assigned 
the ‘predicted classifications’ of their nearest neighbour in the training set. The 
agreement rate is calculated from the confusion matrices in the training and test 
set. The agreement rate represents the percentage of patients in all five classes 
who are correctly classified by the NN algorithm. 

4 Mutual information 

Using this approach, the weights Wi{i = 1,2,3,.. .) are assigned the value of the 
mutual information between each feature in a patient’s profile and the classifica- 
tion of the patient. The mutual information of two variables (Linsker 1990; Shan- 
non and Weaver 1949) is a measure of the common information or entropy that is 
shared between the two variables. In the present study, the mutual information 
between the patient’s classification and a particular feature is used to judge if the 
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feature could, of itself, contribute usefully to the classification scheme. Setting 
weights to the mutual information value of their associated feature provides a 
heuristic way of heavily weighting the most discriminatory features. The mutual 
information between two variables, a feature x and the patient classification y, 
is defined in equation (2): 



Hx,y) = J J2f{x,yi)log^-jl^dx, (2) 

where f{x, yi) is the joint probability density function of x and y being in class i; 
f{x\yi) is the conditional probability density function of x given y being in class 
i; and n is the number of classes. In our study, class j/ is a categorical variable 
taking five discrete values ranging from class 1 through to class 5. Since the 
mutual information is defined on the basis of the prior and conditional proba- 
bility distribution of random variables, accurate estimates can only be obtained 
using very large samples. Although such large samples are rarely available in 
real-world applications, approximate estimates can be obtained from the small 
number of pre-classified profiles that are usually available. In the present study, 
the mutual information value of each of the 15 patient features was estimated 
from the 800 samples in the training set. 



5 Optimising weights in the NN distance metric 

The cost function to be minimised is defined in equation (3): 

AT . 

Y~> ^^mtx , 2 ro\ 

S 

The first term is the fraction of cases that are mis-classified. The second term 
is designed to control the inflation of weight values with a as the regularisation 
coefficient. 



5.1 Genetic algorithm 

Genetic algorithms (Holland 1992) are loosely based on Darwinian principles of 
evolution: reproduction, genetic recombination and the survival-of-the-fittest. In 
the present study, each individual in the total population {N) contains 15 floating 
numbers representing the set of 15 weights associated with the 15 features. The 
values of all weights in the population are initially generated using a random 
number lying between 0 and 1. The values of the cost function are calculated 
for each generation and rank ordered in ascending order. The following three 
operations are used to create two new individuals at each generation who replace 
the two least optimal individuals. 
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Selection: At each iteration, two individuals in the population are selected as 
parents to produce two offspring for the next generation. In the steady state 
approach to generating offspring, the population is kept static with most indi- 
viduals being retained in the next generation. The selection of parent individuals 
is random with more optimal ones having a higher probability of being selected. 

Crossover: In the crossover operation, two new individuals are formed using 
both parents selected in the selection process. The ui weight values from one 
individual (father individual) and n- n\ from another individual (mother indi- 
vidual) are selected. n\ is chosen randomly. 

Mutation: After two offspring are formed, the weight value for each feature has 
a certain probability of being increased or decreased by a small amount. The 
extent of this change is decided by a normally distributed random number. 



5.2 Simulated Annealing (SA) 

Developed by Kirkpatrick et al (Kirkpatrick 1983; Kirkpatrick 1984; Haykin 
1994), SA is designed to avoid sub-optimal solutions that can result when a 
complex system with many parameters is trapped by local minima in parameter 
space. This is achieved by occasionally using solutions that result in an increase 
to the cost function. This helps the system find the global minimum by allowing 
it to jump out of local minima. 

In our study, the Metropolis algorithm is applied to simulate the cooling of 
a physical system in a heat bath as its temperature moves toward thermal equi- 
librium. At each step in this cooling process, the weight of a randomly selected 
feature is changed by a small random amount. The change is always accepted 
if this results in a lower cost function. Even when the cost function is increased 
by the change, the change is still accepted with a small probability given by 
equation (4) 

P(AE) = exp{-—), (4) 

where AE is the change in the energy E of the system (i.e., the cost function) and 
T is the temperature of the system. The probability of acceptance is initially set 
sufficiently high to ensure that nearly all changes are accepted. The probability 
of acceptance is gradually lowered as the system is cooled until the only changes 
accepted are those resulting in a lower cost function. 



6 Results 

Table 1 compares the agreement rates obtained using different weight optimi- 
sation methods with results obtained using an unweighted NNalgorithm based 
on Euclidean distance and a multi-layer perceptron. The performance of the un- 
weighted NN algorithm was poor with an agreement rate of just 51% on the test 
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Table 1. Agreement rate (%) obtained using various weight optimisation methods. 





Training Set 


Test Set 


Euclidean Distance 


53.13 


51.13 


Random-fSA 


72.00 


68.67 


Mutual Info 


63.13 


62.91 


Mutlnfo-bSA 


77.13 


73.43 


Genetic Algorithm 


80.38 


76.69 


Multi-layer Perceptron 


70.63 


71.43 



set. This compared to the excellent agreement rate of 71% obtained using a multi- 
layer perceptron. The performance of the NN algorithm improved markedly to 
a 63% agreement rate when features are asMgned weights equal to their mutual 
information values. Performance is improved further (73% agreement rate) by 
applying simulated annealing to these mutual information values. Performance 
is improved to its highest level (77% agreement rate) when a genetic algorithm 
is used to optimise weights set to random initial values. 

7 Conclusion 

The appropriate weighting for each feature in the NN algorithm is reasonably 
well approximated by the mutual information between each feature in a pa- 
tient’s profile and the patient’s classification. Performance is improved further 
by refining the mutual information values using either simulated annealing or a 
genetic algorithm. When optimised weights are used, the NN algorithm is able to 
achieve an agreement rate on a real-world dataset that is at least as good as that 
obtained using a multi-layer perceptron. As long as an appropriately weighted 
distance metric is used, this study demonstrates the feasibility of applying the 
NN algorithm to a large scale real world application. 
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Abstract. Handling missing attribute values is an important issue for 
classifier learning, since missing attribute values in either training data 
or test (unseen) data affect the prediction accuracy of learned classi- 
fiers. In many real KDD applications, attributes with missing values are 
very common. This paper studies the robustness of four recently devel- 
oped committee learning techniques, including Boosting, Bagging, Sasc, 
and SascMB, relative to C4.5 for tolerating missing values in test data. 
Boosting is found to have a similar level of robustness to C4.5 for tolerat- 
ing missing values in test data in terms of average error in a representa- 
tive collection of natural domains under investigation. Bagging performs 
slightly better than Boosting, while Sasc and SascMB perform better 
than them in this regard, with SascMB performing best. 



1 Introduction 

One primary concern of classifier learning is the prediction accuracy. Recent re- 
search has shown that committee (or ensemble) learning techniques [1] can sig- 
nificantly increase the accuracy of base learning algorithms, especially decision 
tree learning [2,3,4]. Committee learning induces multiple individual classifiers 
to form a committee by repeated application of a single base learning algorithm. 
At the classification stage, the committee members vote to decide the final clas- 
sification. Decision tree learning [5] is one of the most well studied classifier 
learning techniques. It has been widely used in many KDD applications. 

On the other hand, handling missing attribute values is an important issue 
for classifier learning, since missing attribute values in either training data or 
test (unseen) data affect the prediction accuracy of learned classifiers. Most cur- 
rently available decision tree learning algorithms including C4.5 [5] can handle 
missing attribute values in training data and test data. The committee learning 
techniques with these decision tree learning algorithms as their base learners can 
also do this through the decision tree learning algorithms. Quinlan [6] experi- 
mentally compares effectiveness of several approaches to dealing with missing 
values for single decision tree learning, and shows that combining all possible 
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outcomes of a test with a missing value on the test example at the classifica- 
tion stage gives better overall prediction accuracy than other approaches. C4.5 
adopts this approach [.5]. For other approaches to dealing with missing attribute 
values at the tree generation stage and classification stage, see [6]. 

To the best of the authors’ knowledge, no study has been carried out on 
the effect of missing attribute values on the accuracy performance of commit- 
tee learning techniques. In this paper, we study the robustness of four recently 
developed committee learning techniques including Boosting [4,7], Bagging [3], 
Sasc [8], and SascMB [9] relative to C4.5 for tolerating missing values in 
test data. The motivation for this research is as follows. These four committee 
learning techniques can dramatically reduce the error of C4.5. This observation 
was obtained from experiments in domains containing no or a small amount 
of missing attribute values. It is interesting to know how the accuracy of the 
committee learning methods changes relative to that of C4.5, as test sets con- 
tain more and more missing attribute values. In some real world applications, 
unseen (test) examples do contain many missing attribute values, while training 
set contains relatively complete attribute information. For example, in medical 
diagnosis, the training data from historical records of patients could contain al- 
most all attribute values. However, when a new patient presents, it is desirable 
to perform a preliminary diagnosis before the results of some time-consuming 
medical examinations become available if time is crucial, or before conducting 
some very expensive medical examinations if cost is important. In this situation, 
classification needs to be performed on unseen examples (new patients) with 
many missing attribute values. We expect this classification to be as accurate 
as possible, although we cannot expect that it reaches the accuracy level when 
these attribute values are not missing. 

In the following section, we briefly describe the ideas of the four committee 
learning techniques. Section 3 empirically explores the robustness of C4.5 and 
the four committee learning algorithms for tolerating missing attribute values in 
test data. The last section summaries conclusions and outlines some directions 
for future research. The full version of this paper is available from [10], which 
also contains an approach to improving the robustness of the committee learning 
techniques for tolerating missing attribute values in test data. 

2 Boosting, Bagging, Sasc, and SascMB 

Bagging [3] generates different classifiers using different bootstrap samples. Boost- 
ing [2,4,7] builds different classifiers sequentially. The weights of training exam- 
ples used for creating each classifier are modified based on the performance of 
the previous classifiers. The objective is to make the generation of the next clas- 
sifier concentrate on the training examples that are misclassified by the previous 
classifiers. Boost and Bag are our implementations of Boosting and Bagging 
respectively. Sasc [8] builds different classifiers by modifying the set of attributes 
considered at each node, while the distribution of the training set is kept un- 
changed. Each attribute set is selected stochastically. SascMB is a combination 
of Boosting, Bagging, and Sasc. It generates N subcommittees. Each subcom- 
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mittee contains S decision trees built from a bootstrap sample of the training 
data using a procedure that combines Boosting and Sasc by both stochasti- 
cally selecting attribute subsets and adaptively modifying the distribution of 
the training set. Brief descriptions of Boost, Bag, and Sasc as well as a full 
description of SascMB can be found from [9] in the same proceedings. Read- 
ers may refer to the references mentioned above for further details about these 
algorithms. 

3 Effects of Missing Values on Accuracy 

In this section, we use experiments to explore how the accuracy of the committee 
learning methods changes relative to that of C4.5, as test sets contain more 
and more missing attribute values. Thirty-seven natural domains from the UCI 
machine learning repository [11] are used.^ 

In every domain, two stratified 10-fold cross-validations were carried out 
for each algorithm. Some of these 37 domains contain missing attribute values, 
but some not. To simulate the situation where unseen examples contain certain 
amount of missing attribute values, we randomly introduce missing attribute 
values into test sets at a given level L. For each fold of the cross-validations, 
each learning algorithm is applied to the training set. To get a reliable test 
accuracy estimate of each learned classifier, it is evaluated using ten different 
corrupted versions of the original test set of the fold with the same level of 
missing attribute values, L, and the evaluation results are averaged. Each of the 
corrupted versions is derived by replacing, with the probability L, each attribute 
value in the original test set with the missing value “?” . In each domain, all the 
algorithms are run on the same training and test set partitions with the same ten 
corrupted versions of each test set. The result reported for each algorithm in each 
domain is an average value over 20 trails. We investigate 10%, 20%, 30%, 40%, 
and 50% for L in this study. Pruned trees are used for all the algorithms. All 
Boost, Bag, Sasc, and SascMB use probabilistic predictions (without voting 
weights) for voting to decide the final classification. The committee size is set at 
100 in the experiments for Boost, Bag, and Sasc. The subcommittee size and 
the number of subcommittees are set at 5 and 20 respectively, resulting in 100 
trees in total for SascMB. The probability of each attribute being selected into 
the subset is set at the default, 33%, for Sasc and SascMB. 

Figure 1 shows the average error rates of the five learning algorithms over the 
37 domains as a function of the missing attribute value level L. The detailed error 
rates of these algorithms and error ratios of each committee learning algorithm 
over C4.5 and more discussions can be found in [10]. From these experimental 
results, we have the following observations. 

( 1 ). All the four committee learning algorithms can significantly reduce the 
average error of C4.5 at all missing attribute value level from 0 to 50% across 

^ In [9], We use 40 domains. To reduce the computational requirements of the ex- 
periments for this study, we exclude three largest domains from the test suite. The 
partial results in these 3 domains that we have got are consistent with our claims in 
this paper. 
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Average over the 37 domains 




C4.5 

Boost 

Bag 

Sasc 

SascMB 



missing value level L (%) in test set 



Fig. 1. Effects of missing values in test data on the error of the five learning 
algorithms 



the 37 domains. Among them, SascMB always has the lowest average error. 
These are clearly shown in Figure 1. 

(2) . The average error of C4.5 in the 37 domains increases linearly to the 
growth of the missing attribute value level. 

(3) . When L = 0, Boost performs better than Bag and Sasc in terms of 
either lower average error rate or lower average error ratio to C4.5. However, 
Sasc is superior, on average, to Boost when test sets contain 10% missing 
attribute values or more, while Bag is superior, on average, to Boost when test 
sets contain 30% missing attribute values or more. This indicates that Bag and 
Sasc perform better than Boost for tolerating missing attribute values in test 
data. 

(4) . The average error difference between Boost and C4.5 grows slightly 
when L increases from 0 to 20%. It, then, starts to drop. When L reaches 50%, the 
average error difference becomes smaller than the corresponding value for L = 0 . 
However, since the average error difference changes not much for different L , 
Boost can be considered as at a similar level of robustness to C4.5 for tolerating 
missing values in test data. 

(5) . In terms of the robustness for tolerating missing attribute values in test 
data. Bag performs better than Boost. The average error difference between 
Bag and C4.5 grows (faster than that for Boost) when L increases from 0 
to 30%. After that, the average difference decreases slightly, but the average 
difference for L = 50% is higher than that for L = 0 and L = 10%. 

(6) . In the same regard, Sasc and SascMB perform better than Boost and 
Bag, with SascMB performing best. The average error difference relative to 
C4.5 for either Sasc or SascMB keeps growing when L changes from 0 to 40%. 
At the end of the curves, that is at the point L = 50%, the difference reduces 
only very slightly for either Sasc or SascMB. 



4 Conclusions and Future Work 

In this paper, we have empirically explored the robustness of four recently de- 
veloped committee learning algorithms relative to C4.5 for tolerating missing 
attribute values in test data using a representative collection of natural do- 
mains. The four committee learning algorithms are Boosting, Bagging, Sasc, 
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and SascMB with C4.5 as their base learner. It has been found that Boosting is 
at a similar level of robustness to C4.5 for tolerating missing values in test data. 
Bagging performs slightly better than Boosting, while Sasc and SascMB per- 
form better than them, with SascMB performing best in this regard. 

The stochastic attribute selection component may contribute to the greater 
robustness of Sasc and SascMB for tolerating missing attribute values in test 
data than the other algorithms, since it makes the former algorithms have more 
chances to generate trees with different attributes than the latter algorithms. 
However, this issue is worth further investigating. This paper only addresses the 
problem of tolerating missing attribute values in test data. Another interest- 
ing research topic is the robustness of these algorithms for tolerating missing 
attribute values in training data or in both training and test data. 
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Abstract. This paper presents a study on mixed similarity measures 
(MSM) that allows doing classification and clustering in many situations 
without discretization. For supervised classification we do experimental 
comparative studies of classifiers built by decision tree induction system 
C4.5 and k nearest neighbor rule using MSM. For unsupervised clustering 
we first introduce an extension of fc-means algorithm for mixed numeric 
and symbolic data, then evaluate clusters obtained by this algorithm 
with natural classes. Experimental studies allow us to draw conclusions 
(meta-knowledge) that are significant in practice about the mutual use 
of discretization techniques and MSM. 



1 Introduction 

Similarity measures between objects play an essential role in many fields of 
computing such as pattern recognition (PR), machine learning (ML) or knowl- 
edge discovery in databases (KDD) - a rapidly growing interdisciplinary field 
that merges together techniques of databases, statistics, machine learning and 
others in order to extract useful knowledge from large databases. Initially, pat- 
tern recognition techniques have been developed mainly to deal with numerical 
datasets. However, most real-world databases in KDD contain heterogeneously 
numeric and symbolic attributes (mixed data) that require efficient techniques 
in these situations. 

This paper focuses on finding an answer to a question: whether or not MSM 
are beneficial for KDD and if so when they are more appropriate than discretiza- 
tion techniques? We limit ourselves on a study of the MSM developed in [5] from 
a similarity measure in biological taxonomy [2]. 

2 A Mixed Similarity Measure 

In [5], the authors present a similarity measure for data with mixed numeric and 
nominal attributes where its core is a similarity measure developed for biological 
taxonomy [2]. The key idea of this MSM is uncommon feature values make 
greater contribution to the overall similarity between two objects. 

Two objects i and j are considered more similar than two other objects I and 
m if i and j exhibit a greater match in attribute values that are more uncommon 
in the population. The similarity measure between {Sij)k and (S;m)fc can be 
expressed as follows for symbolic attributes: 
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} 



< {Si^)k (1) 



= (^>)^) A {{Vi), = (y^)fc) 

{{Pi)k — iPj)k) ^ {{Pl)k — (Pm)fc) 
where (pi)k, (Pj)ki (pi)fc> and (pm)k define the probabilities of occurrence of the 
respective attribute values (Vi)k, (Vj)k: (^l)k and (Vm)*; in the population. 

When comparing similarity for pairs of numeric attribute values, the measure 
takes both the magnitude of the attribute value difference and the uniqueness 
of the attribute values pair into account. Given two pairs of objects {i,j) and 
(I, m). When the magnitude of difference for two pairs of values are equal, then 
the uniqueness of segment defined by the values weights the similarity index. 
The uniqueness of the segment is computed by summing up the frequency of 
occurrences of all values encompassed by the pair of values, i.e., Pt 

j Pt for object value pairs {(Ti)fc, (Vj)fc), and ((Vi)*,, (K„)i;), respectively. 



Using the dissimilarity score of the pair as {Dij)k = 1 — (S{j)k, the 

similarity test scores for numeric attributes is combined by Fisher’s transfor- 
mation: 

iXc% = -2j2lnm,)k) (4) 

it=i 

where tc is the number of numeric attributes in the data. The similarity test 
scores from nominal attributes are combined using Lancaster’s mean value 
transformation [4]: 

{Dij)kif^{Dij)k ~ (Djj)k^''^i^ij)k 

(Dij)k - {Dij)'k 






{Sij)k < {Sim)k (2) 

(3) 



where td is the number of nominal attributes in the data, {Dij)k is the dissim- 
ilarity score for nominal attribute value pair ((Vi)fc, (Vj)*,), (Dij)j. is the next 
smaller dissimilarity score in the nominal set. The significance value of this 
distribution can be looked up in standard tables or approximated from the ex- 
pression: 



Dij 




(td + to-l) 

E 



fc=o 



k\ 



(6) 



where Xij — (Xc)ij + (Xd)ij- The overall similarity score representing the set 
of {tc + td) independent similarity measures is Sij = 1 — Dij. The similarity 
measures obtained from individual attributes are combined to give the overall 
similarity between pairs of objects. For details and illustrations of this MSM, 
see [5]. 
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3 Classification with Mixed Similarity Measures 

We compare the performance of two classifiers, one is built by the decision tree 
induction system C4.5 [7] that does discretization of numeric data, and another 
is the fc-nearest neighbor rule [1]. From the obtained results we draw some con- 
clusions on the mutual use of these two techniques in practice. The experiments 
are done as follows: (1) Construct a classifier by fc-NNR using this MSM with- 
out discretization. To classify an unknown instance, its k nearest neighbors are 
determined by this MSM and the instance will be assigned to the most com- 
mon class among k nearest neighbors. We varied different values of k and fixed 
fc = 7 as it gives stable results in these experiments; (2) Construct a classifier by 
decision tree system C4.5 with discretization; and (3) Carry out experimental 
comparative studies of these classifiers on databases from the UCI repository 
which contain both symbolic and numeric attributes by a 10-fold stratified cross 
validation. The predictive accuracy is estimated with a 95 % confidence interval 
estimate. 

From the UCI repository of databases we select 18 datasets that contain 
mixed symbolic and numeric data. Table 1 presents results of predictive accuracy 
(%) of C4.5 and A:-NNR with this MSM when databases contain more numeric 
attributes than symbolic ones. Table 2 presents results when databases contain 
more symbolic attributes than numeric ones. These results allow us to formulate 
some hypotheses (meta-knowledge) about the mutual use of this MSM and dis- 
cretization for a given task of classification: (1) When a database contains more 
numeric attributes than symbolic ones, it maybe better to use MSM than dis- 
cretization. The experimental results reconfirm an intuition as in this case we do 
not have to discretize many numerical attributes: (la) The predictive accuracy 
of fc-NNR with this MSM is often higher than that of C4.5 using discretization; 
(lb) However, A:-NNR requires much more time than C4.5 to calculate MSM in 
these cases; (2) When a database contains more symbolic attributes than nu- 
meric ones, it maybe better to do discretization and use classifiers with symbolic 
data, vice-versa: (2a) The predictive accuracy of of fc-NNR with this MSM is 
often lower than that of C4.5 using discretization; (2b) However, the compu- 
tational time for this MSM is much lower than that of cases of many numeric 
attributes. 



Table 1. Accuracy of two classifiers (when more numeric than symbolic attributes) 



Database 


Instances 


Attributes 


% of Attributes 


Class 


C4., 


) 


fc-NNR 








(INum 


bym) 










imports-85 


205 


25 


62 


38 


7 


33.3 ± 


6.4 


61.9 ±6.4 


machine 


209 


8 


78 


28 


8 


63.2 ± 


6.7 


66.8 ± 5.4 


echocardiogram 


132 


12 


69 


31 


2 


55.4 ± 


8.5 


63.5 ± 7.0 


ecoli 


336 


7 


88 


12 




76.1 ± 


4.8 


79.5 ±4.0 


heart 


270 


13 


54 


46 


2 


77.1 ± 


5.0 


82.1 ±4.6 


meta-data 


528 


21 


91 


: 9 


2 


90.6 ± 


2.6 


100 


yeast 


1484 


9 


88 


12 


10 


37.6 ± 


4.2 


55.4 ±2.5 
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Table 2. Accuracy of two classifiers (when more symbolic than numeric attributes) 



Database 


Instances Attributes 


% of Attributes p, 
(Num ; Sym) 


C4.5 


fc-NNR 


adult 


1 1000 


14 


43 : 53 


7 


86.8 ± 2.3 


74.7 ± 2.7 


anneal 




38 




6 


92.0 ± 1.7 


70.3 ± 2.9 


ann-thyroid 


1 460 


21 


29 : 71 


3 


95.8 ± 1.7 


94.6 ± 2.4 


australian 




14 


43 : 57 


3 


86.8 ± 2.5 


63.9 ± 3.4 


crx 




15 


40 : 60 


2 


83.8 ± 2.7 


61.0 ±3.4 


german 


1000 


20 


35 : 65 


2 


69.2 ± 2.9 


64.1 ±3.0 


hepatitis 


155 


19 


32 : 68 


2 


57.3 ± 7.9 


54.0 ± 7.3 


horse-colic 


368 


28 


32 : 68 


2 


68.9 ± 4.8 


61.6 ±4.5 


solar-flare 


1389 


12 


23 : 77 


2 


87.4 ± 1.7 


72.6 ± 2.4 


allbp 




29 


24 : 76 


3 


97.4 ± 1.8 


45.1 ± 10.3 


ZOO 




18 


12 : 88 


7 


91.0 ± 1.8 


45.1 ± 10.3 



4 Clustering with Mixed Similarity Meastxres 

The well-known fc-means algorithm [6], [3] is very simple and powerful, fc-means 
algorithm and its variants are efficient in clustering large datasets and thus very 
suitable for knowledge discovery and data mining. Given a set X of n objects 
and an integer A: > 1, fc-means algorithm produces a partition of X into k 
clusters. This algorithm can be briefly summarized in the four following steps: 
(1) Select arbitrarily k centers of k clusters from A'”; (2) Generate a new partition 
by assigning each object to its nearest cluster center; (3) Compute k new cluster 
centers; and (4) Repeat steps 2 and 3 until the stopping condition satisfies. 

However, the use of this algorithm is limited to numeric data as it requires 
computing new centers of clusters and a cost function that all based on the 
similarity measure between instances. In [8], the author presented an extension 
of fc-means algorithm for symbolic data. Using MSM we present here another 
extension of A:-means algorithm for mixed symbolic and numeric data. The key 
point in this extension is in step 3: how to compute new cluster centers for mixed 
data? In fact, we determine the center of each cluster as a vector containing 
mixed symbolic and numeric components according to the original attributes 
in the dataset. The components of this center are calculated as follows: (1) For 
each numeric attribute, its value for the new center is the mean of values of 
the cluster’s instances at this attribute; and (2) For each symbolic attribute, its 
value for the new center is the value with highest frequency among values of the 
cluster’s instances at this attribute. 

We have implemented this extension of fc-means algorithm with MSM and 
carried out experiments with several UCI datasets containing mixed data. To 
investigate the performance of clustering by fc-means algorithm with MSM, ex- 
periments are done as follows: (1) Choose the value k as the number of natural 
classes in the given dataset; (2) Choose k initial centers by one of two ways: (2a) 
selecting arbitrarily k first instances in the dataset; (2b) selecting k representa- 
tives of clusters as k instances from original classes in the dataset (if possible); 
(3) For each generated class we determine its instances, then compare them to 
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Table 3. Prediction accuracy by fc-means algorithm with MSM for mixed datasets 



Database 


Instances Attributes 


% of Attributes 
(Num : Sym) 


Class 


Prediction accuracy (%) 








Centers by (2a) 


Centers by (2b) 


allbp 


517 


29 


24 : 76 


3 


68.20 


59.49 


ann-thyroid 


460 


21 


29 : 71 


3 


75.30 


82.20 


australian 


690 


14 


43 : 57 


3 


79.12 


86.10 


cleve 


303 


13 


50 : 50 


5 


56.61 


83.67 


crx 


690 


15 


40 : 60 


2 


76.44 


79.92 


hepatitis 


155 


19 


32 : 68 


2 


53.40 




solar-flare 


1389 


12 


23 : 77 


2 


56.22 


51.11 


ZOO 


109 


18 


12 : 88 


7 


79.71 


96.54 



those of the original class to find how they match each other; (4) Run 20 itera- 
tions for each dataset and determine the best prediction accuracy for (2a) and 
(2b). The experimental results are shown in Table 3. 

The following observations can be made from these experiments: (1) The 
MSM is a quite good measure for clustering datasets with both numeric and 
symbolic data. In some cases the prediction accuracy is very high, such as 96% 
for the dataset ’zoo’ (in comparing with the class of original data); (2) The 
highest prediction accuracy of partitions were often found within 10 iterations 
of the /:-means algorithm for arbitrarily centers by (2a), and within 5 iterations 
for arbitrarily centers by (2b); (3) It is better to choose initially cluster centers 
by their representatives from original classes if some background knowledge or 
class labels are available; (4) Most computation time is for calculating MSM. 

To develop another MSM that preserves good properties of the considered 
MSM but much faster, and to design efficient KDD algorithms using new MSM 
are our objectives of further work. 
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Abstract. Association rules are a class of important regularities in databases. They 
are found to be very useful in practical applications. However, the number of asso- 
ciation rules discovered in a database can be huge, thus making manual inspection 
and analysis of the rules difficult. In this paper, we propose a new framework to 
allow the user to explore the discovered rules to identify those interesting ones. 
This framework has two components, an interestingness analysis component, and a 
visualization component. The interestingness analysis component analyzes and or- 
ganizes the discovered rules according to various interestingness criteria with re- 
spect to the user’s existing knowledge. The visualization component enables the 
user to visually explore those potentially interesting rules. The key strength of the 
visualization component is that from a single screen, the user is able to obtain a 
global and yet detailed picture of various interesting aspects of the discovered 
rules. Enhanced with color effects, the user can easily and quickly focus his/her 
attention on the more interesting/useful rules. 



1. Introduction 

Association rules, introduced in [2], have received considerable attention in data 
mining research and applications. The main strengths of association rule mining are 
that the target of discovery is not pre-determined, and that it is able to find all associa- 
tion rules that exist in the database. Thus, association rules can reveal valuable and 
unexpected information in the database. However, these strengths are also its weak- 
ness, i.e., the number of discovered rules can be huge, in thousands or even tens of 
thousands, which makes manual inspection of the rules to identify the interesting ones 
an almost impossible task. Automated assistance is thus needed. 

Determining the interestingness of a rule is not a simple task. A rule can be inter- 
esting to one person but not interesting to another. The interestingness of a rule is 
essentially subjective. It depends on the user’s existing knowledge about the domain 
and his/her current interests. 

This paper proposes a new interactive and iterative framework to help the user find 
interesting association rules. The proposed framework consists of two components, an 
interestingness analysis component and a visualization component. The interestingness 
analysis component allows the user to specify his/her existing knowledge. It then uses 
this input knowledge to analyze the discovered rules according to various interesting- 
ness criteria, and through such analysis to identify those potentially interesting rules 
for the user. The visualization component makes it easy for the user to visually ex- 
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plore the potentially interesting rules. The key strength of the visualization component 
is that from a single screen, the user is able to obtain a global and yet detailed picture 
of various interesting aspects of the discovered rules. This enables him/her to visually 
detect any unusual pattern without the need to browse through a large number of rules. 
Three main types of information are shown on the screen: 

(1) different kinds of potentially interesting rules. 

(2) different degrees of rule interestingness and the number of rules in each kind. 

(3) interesting items in the conditional part or the consequent part of the rules. 
Enhanced with color effects, these types of information can lead the user to easily and 
quickly explore various aspects of the discovered rules and to focus his/her attention 
on those truly interesting/useful ones. The whole system works as follows: 

Repeat until the user decides to stop 

1 the user specifies some existing knowledge or modifies the knowledge specified pre- 
viously; 

2 the system analyzes the discovered rules according to some interestingness criteria; 

3 the user inspects the analysis results through the visualization component, saves the 
interesting rules, and removes those unwanted rules. 



2. Association Rules and Subjective Rule Interestingness 

2.1 Generalized association rules 



Let I = {ij, ..., ij be a set of items, Tbe a set of transactions, and G be a set of tax- 
onomies or class hierarchies. A taxonomy is a directed acyclic graph on the items in 
/, where an edge represents an is-a relationship. A taxonomy example is shown in Fig 
1. A generalized association rule [15] is an implication of the form X Y, where X 
(Z I, Y (Z I, and An Y = 0. The rule X ^ Y holds in the transaction set T with confi- 
dence c if c% of transactions in T that support X also support Y. The rule has support s 
in T if s% of the transactions in T contains A u T. A transaction t that supports an 
item in I also supports all its ancestors in I. For example, an association rule could be: 
cheese, milk ^ Fruit [support = 5%, confidence = 70%], 
which says that 5% of people buy cheese, milk and Fruit together, and 70% of the 
people who buy cheese and milk also buy Fruit (of any kind). 




Fruit 

grape pear apple 



Meat 

beef pork chicken 



Fig 1 . An example taxonomy 



2.2 Subjective rule interestingness 

Past research has identified two main subjective rule interestingness measures: 
Unexpectedness [14, 7]: Rules are interesting if they “surprise” the user. 

Actionability [11]: Rules are interesting if the user can do something with them to 
his/her advantage. 

The two measures of interestingness are not mutually exclusive. Interesting rules 
can be classified into three categories [14]: 

1 : rules that are both unexpected and actionable, 

2: rules that are unexpected but not actionable, and 

3: rules that are actionable but expected. 

Category 1 and 2 can be handled by finding unexpected rules, and category 3 can be 
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handled by finding the rules that conform to the user’s knowledge. This paper pro- 
poses a new framework to help the user find these two types of rules, i.e., unexpected 
rules, and expected rules (or conforming rules). 

3. The Interestingness Analysis Component 

This component uses the user’s existing knowledge to analyze and identify various 
types of potentially interesting rules from the discovered association rules. 

3.1. The specification language 

A specification language is designed to enable the user to express his/her existing 
knowledge. This language focuses on representing the user’s existing knowledge 
about associative relations on items in the database. The basic syntax of the language 
takes the same format as association rules. 

This language has three levels of specifications. Each represents knowledge of a 
different degree of preciseness. They are: general impressions, reasonably precise 
concepts, and precise knowledge. The first two levels represent the user’s vague 
feelings. The last level represents his/her precise knowledge. This division is impor- 
tant because a user typically has a mixture of vague and precise knowledge. 

The proposed language also uses the idea of class hierarchy (or taxonomy) as in 
generalized association rules. The hierarchy in Fig 1 can also be represented by: 
{grape, pear, apple} cz Fruit a Fooditem 
{milk, cheese, butter) cz Dairy _product cz Fooditem 
{beef, pork, chicken) cz Meat cz Fooditem 

Fruit, Dairy _product. Meat and Fooditems are classes (or class names), grape, pear, 
apple, milk, cheese, beef, pork, chicken, #Fruit, #Dairy_product, #Meat and 
#Fooditems are items. Note that in generalized association rules, class names can also 
be treated as items, in which case, we append a “#” in front of a class name. Note also 
that in the proposed language, a class hierarchy does not need to be constructed be- 
forehand. A class can be created any time when needed by using a set of items (see 
the examples below). 

General Impression (GI): It represents the user’s vague feeling that there should be 
some associations among some classes of items, but he/she is not sure how they 
are associated. This can be expressed with: 

gi(<S^, ..., S^>) [support, confidence] 

where (1) Each 5 is one of the following: an item, a class, or an expression C-t 
or C*, where C is a class. C-t and C* correspond to one or more, and 
zero or more instances of the class C, respectively. 

(2) A discovered rule: a^, ..., a^^ b^, ..., b^, conforms to the Gl if <a^,..., 
a , b,,..., b> can be considered to be an instance of <S,, ..., S >, oth- 
erwise it is unexpected with respect to the GI. 

(3) Support and confidence are optional. The user can specify the mini- 
mum support and the minimum confidence of the rules that he/she 
wants to see. 

Example 1: The user believes that there exist some associations among {milk, 
cheese). Fruit items, and beef (assume we use the class hierarchy in Fig 1). He/she 
specifies this as: 

gi(<{milk, cheese}*, Fruit-i-, beef>) 

{milk, cheese) here represents a class constructed on the fly unlike Fruit. The 
following are examples of association rules that conform to the specification: 
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apple beef 

grape, pear, beef ^ milk 

The following two rules are unexpected with respect to this specification: 

(1) milk ^ heef (2) milk, cheese, pear ^ clothes 

(1) is unexpected because Fruit+ is not satisfied. (2) is unexpected because beef is 
not present in the rule, and clothes is not from any of the elements in the GI. 
Reasonably Precise Concept (RPC): It represents the user’s concept that there 
should be some associations among some classes of items, and he/she also knows 
the direction of the associations. This can be expressed with: 
rpc(<S^, ..., 5^ ^ k|, ..., V>) [support, confidence] 
where (1) Each 5 or V is the same as S, in the GI specification. 

(2) A discovered rule, a^, ^ b^, ..., conforms to the RPC, if the 

rule can be considered to be an instance of the RPC, otherwise it is 
considered as unexpected with respect to the RPC. 

(3) Support and confidence are again optional. 

Example 2: Suppose the user believes the following: 

/•/7c(<Meat, Meat, #Dairy_product ^ {grape, apple }+>) 

The following are examples of association rules that conform to the specification: 
beef, pork. Dairy _product ^ grape 
beef, chicken. Dairy _product ^ grape, apple 
The following two rules are unexpected with respect to the specification: 

(1) pork. Dairy _product ^ grape (2) heef, pork ^ grape 
(1) is unexpected because it has only one Meat item, but two Meat items are 
needed as we have two Meat’s in the specification. (2) is unexpected because 
Dairy _product is not in the conditional part of the rule. 

Precise knowledge (PK): The user believes in a precise association. This is ex- 
pressed with: 

pk(<S,, ..., ^ V^, ..., V>) [support, confidence] 

where ( 1 ) Each 5, or V. is an item in I. 

(2) A discovered rule: a,, ..., hi, ..., [sup, confid], is equal to the 

PK, if the rule part is the same as Sj, ..., V^, ..., V,. Whether it 

conforms to the PK or is unexpected depends on the support and con- 
fidence specifications. 

(3) Support and confidence need to he specified (they are not optional). 
Example 3: Suppose the user believes the following: 

/7k(<#Meat, milk ^ apple>) [10%, 50%] 

The discovered rule below conforms to the PK quite well because the supports and 
confidences of the rule and the PK are quite close. 

Meat, milk ^ apple [8%, 53%] 

However, if the discovered rule is the following: 

Meat, milk — ^ apple [4%, 30%] 

then it is less conforming, but more unexpected, because its support and confi- 
dence are quite far from those of the PK. 

3.2. Analyzing the discovered rules using user’s existing knowledge 

We now present how to use the user’s specifications to analyze the discovered rules. 
For GIs and RPCs, we only perform syntax-based analysis, i.e., comparing the syn- 
tactic structure of the discovered rules with GIs and RPCs. It does not make sense to 
do semantics-based analysis because the user does not have precise associations in 
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mind. Using PKs, we can perform semantics-based analysis (based on support and 
confidence) on the discovered rules. Due to space limitations, we could not present 
this in the paper (see [9] for details). 

Let f/be the set of user’s specifications representing his/her knowledge space. Let 
A be the set of discovered association rules. The proposed technique analyzes the 
discovered rules by “matching” and ranking the rules in A in a number of ways for 
finding different kinds of interesting rules, conforming rules, unexpected consequent 
rules, unexpected condition rules and both-side unexpected rules. 

Conforming rules: A discovered rule A. e A conforms to a piece of user’s knowl- 
edge U. e [/ if both the conditional and consequent parts of A. match [/ e U 
well. We use confm.. to denote the degree of conforming match. 

Purpose: conforming rules show us those discovered rules that conform to or are 
consistent with our existing knowledge fully or partially. 

Unexpected consequent rules: A discovered rule A. e. A has unexpected consequents 
with respect to a U. e U if the conditional part of A. matches C/ well, but not 
the consequent part. We use unexpConseq.. to denote the degree of unexpected 
consequent match. 

Purpose: unexpected consequent rules show us those discovered rules that may be 
contrary to our existing knowledge. These rules are often very interesting. 
Unexpected condition rules: A discovered rule A. g A has unexpected conditions 
with respect to a [/ g U if the consequent part of A. matches [/ well, but not 
the conditional part. We use unexpCond^ to denote the degree of unexpected 
condition match. 

Purpose: unexpected condition rules show us that there are other conditions that 
can lead to the consequent of the specification. We are thus guided to explore 
unfamiliar territories. 

Both-side unexpected rules: A discovered rule A. g A is both-side unexpected with 
respect to a {/ g U if both the conditional and consequent parts of the rule A. 
do not match U. well. We use bsUnexp.. to denote the degree of both-side un- 
expected match. 

Purpose: both-side unexpected rules remind us that there are other rules whose 
conditions and consequents are not mentioned in our specification. It helps us 
to go beyond our existing concept space. 

The values for confm.., unexpConseq.., unexpCond.^, and bsUnexp.j are between 1.00 
and 0. 1.00 represents the complete match, either the complete conforming or the 
complete unexpectedness match, and 0 represents no match. Let L.. and be the 
degrees of condition and consequent match of rule A. against [/ respectively. We have 
(for both GIs and RPCs), 

confm.. = L..*' R... unexpConseq.^ _ 1 Lij - Rij < 0 

[Lij - Rij Lij - Rij > Q' 

unexpCond.j _ I 0 Rij - Lij < 0 

[Rij - L ij Rij - Lij > o’ 

bsUnexpij= 1- max{confm.., unexpConseq^., unexpCondfy, 

We use L.. - R.. to compute the unexpected consequent match degree because we wish 
to rank those rules with high L.. but low R.. higher. Similar idea applies to unexpCon- 
d... The formula for bsUnexp.. ensures that those rules with high values in any other 
three categories should have low values here, and vice versa. 

Due to the space limitation, we are unable to give the detailed computation meth- 
ods for L.. and R.., which depend on whether U. is a GI or a RPC. The computations 
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can all be done efficiently. See [9] for full details. After confm.j, unexpConseq^j, un- 
expCondij, and bsUnexp.. have been computed, we can rank the discovered rules with 
respect to a U-. It is also possible to rank the rules with respect to the whole set of 
specifications U. However, in our applications, we find that such rankings can be 
quite confusing, and are thus omitted. 

4. The Visualization Component 

After the discovered rules are analyzed with the method presented in the last section, 
we want to display those different types of potentially interesting rules to the user. 
The issue here is how to show the essential aspects of the rules such that we can take 
advantage of the human visual capabilities to allow the user to identify the truly inter- 
esting rules easily and quickly. Let us discuss what the essential aspects are: 

1. Types of potentially interesting rules: We should separate them because different 
types of interesting rules give the user different information. 

2. Degrees of interestingness (“match” values): We should group rules according to 
their degrees of interestingness. This enables the user to focus his/her attention on 
the most unexpected (or conforming) rules first and to decide whether to view 
those rules with low degrees of interestingness. 

3. Interesting items: We focus on showing the interesting items rather than the rules. 
This is perhaps the most crucial decision that we have made. In our applications, 
we find that it is those unexpected items that are most important to the user be- 
cause due to 1 above, the user already knows what kind of interesting rules he/she 
is looking. For example, when the user is looking at unexpected consequent rules, 
it is natural that the first thing he/she wants to know is what are the unexpected 
items in the consequent parts. Even if we show the rules, the user still needs to 
look for the unexpected items in the rules. 

The main screen in the visualization system contains all the above information. Be- 
low, we use an example to describe the visualization system. 

The visualization system consists of 4 main modules: 

1 . Class hierarchy builder: it allows the user to build class hierarchies as in Fig 1 . 

2. GI viewer: it allows the user to specify GIs and to visualize the results produced 
by the interestingness analysis system. 

3. RPC viewer: it allows the user to specify RPCs and to visualize the results pro- 
duced by the interestingness analysis system. 

4. PK viewer: it allows the user to specify PKs and to visualize the results produced 
by the interestingness analysis system. 

Here, we only focus on presenting the RPC viewer. Due to space limitations, we are 
unable to show the others. They are similar in concept to the RPC viewer. We will 
also not discuss the Class hierarchy builder since it is straightforward. 

4.1. The example setting 

Our example uses a RPC specification. The rules in the example are a small subset of 
rules (857 rules) discovered in an exam results database. This application tries to 
discover the associations between the exam results of a set of 7 specialized courses 
(called GA courses) and the exam results of a set of 7 basic courses (called GB 
courses). A course together with an exam result form an item, e.g., GA6-1, where 
GA6 is the course code and “1” represents a bad exam grade (“2” represents an aver- 
age grade and “3” a good grade). The discovered rules and our existing concept 
specification are listed below. 
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• Discovered association rules: The rules below have only GA course grades on 
left-hand-side and GB course grades on right-hand-side (we omit their support 
and confidence). 



R1 


GAl-3 




GB2-3 


R7: 


GA4-1 ^GB7-2 




R2 


GA4-3 




GB4-3 


R8: 


GA6-2 ^ GB7-2 




R3 


GA2-3 




GB2-3 


R9: 


GA5-1, GA2-2 ^ 


GB2-2 


R4 


GA2-3 




GB5-1 


RIO: 


GA5-2, GAl-2 ^ 


GB3-2 


R5 


GA6-1 




GBl-3 


Rll: 


GA6-1, GA3-3 ^ 


GB6-3 


R6 


GA4-2 




GB3-3 


R12: 


GA7-2, GA3-3 ^ 


GB4-3 



• Our existing concept specification 

Assume we have the common belief that students good in GA courses are likely to 
be good in GB courses. This can be expressed as a RPC (also see it in Fig 2): 
Sped: rpc(GA-good — > GB-good) 

where the classes, GA-good and GB-good, are defined as follows: 

GA-good 3 {GAl-3, GA2-3, GA3-3, GA4-3, GA5-3, GA6-3, GA7-3} 
GB-good 3 {GBl-3, GB2-3, GB3-3, GB4-3, GB5-3, GB6-3, GB7-3} 

4.2. Viewing the results 

After running the system with the above RPC specification, we obtain the screen in 
Fig 2 (the main screen). We see “RPC” in the middle. To the bottom of “RPC”, we 
have the conforming rules visualization unit. To the left of “RPC”, we have the unex- 
pected condition rules visualization unit. To the right, we have the unexpected conse- 
quent rules visualization unit. To the top, we have both-side unexpected rules visuali- 
zation unit. Below, we briefly discuss these units in turn with the example. 
Conforming rules visualization unit. Clicking on Conform, we will see the complete 
conforming rules ranking in a pop-up window: 



Rank 1: 


1.00 


R1 


GAl-3 ^ GB2-3 


Rank 1: 


1.00 


R2 


GA4-3 GB4-3 


Rank 1: 


1.00 


R3 


GA2-3 GB2-3 


Rank 2: 


0.50 


Rll 


GA6-1, GA3-3 ^ GB6-3 


Rank 2: 


0.50 


R12 


GA7-2, GA3-3 GB4-3 



The number (e.g., 1.00, and 0.50) after each rank number is the conforming match 
value, confmn. The first three rules conform to our belief completely. The last two 
only conform to our belief partially because GA6-1 and GA7-2 are unexpected. 
This list of rules can be long in a real-life application. The following mechanisms 
help the user focus his/her attention, i.e., enabling him/her to view rules with dif- 
ferent degrees of interestingess ( “match” values) and to view the interesting items. 

• On both sides of Conform we can see 4 pairs of boxes, which represent sets of 
rules with different conforming match values. If a pair of boxes is colored, it 
means that there are rules there, otherwise there is no rule. The line connecting 
“RPC” and a pair of colored boxes also indicates that there are rules under 
them. The number of rules is shown on the line. Clicking on the box with a 
value will give all the rules with the corresponding match value and above. For 
example, clicking on 0.50 shows the rules with 0.50 < confnin < 0.75. Below 
each colored box with a value, we have two small windows. The one on the top 
has all the rules’ condition items from our RPC specification, and the one at the 
bottom has all the consequent items. Clicking on each item gives us the rules 
that use this item as a condition item (or a consequent item). 

• Clicking on the colored box without a value (below the valued box) brings us 
to a new screen (not shown here). From this, the user sees all the items in dif- 
ferent classes involved, and also conforming and unexpected items. 
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Fig 2. RPC main visualization screen 

Unexpected condition rules visualization unit: The boxes here have similar meanings 
as the ones for conforming rules. From Fig 2, we see that there are 4 unexpected 
condition rules. Two have the unexpected match value of 1.00 and two have 0.50. 
The window (on the far left) connected to the box with a match value gives all the 
unexpected condition items. Clicking on each item reveals the relevant rules. 
Similarly, clicking on the colored box next to the one with a value shows both the 
unexpected condition items and the items used in the consequent part of the rules. 
To obtain all the rules in the category, we can click Unexpected Conditions. 



Rank 1 : 


1.00 


R5 


GA6-1 -A GBl -3 


Rank 1 : 


1.00 


R6 


GA4-2 GB3-3 


Rank 2: 


0.50 


Rll 


GA6-l,GA3-3^GB6-3 


Rank 2: 


0.50 


R12 


GA7-2, GA3-3 ^ GB4-3 



1.00 and 0.50 are the unexpCondu values. Here, we see something quite unex- 
pected. For example, many students with bad grades in GA6 actually have good 
grades in GBl. 

Unexpected consequent rules visualization unit: This is also similar to the conforming 
rules visualization unit. From Fig 2, we see that there is only one unexpected con- 
sequent rule and the unexpected consequent match value is 1.00. Clicking on the 
colored box with 1 .00, we will obtain the unexpected consequent rule: 

Rankl: 1.00 R4 GA2-3->GB5-l 

This rule is very interesting because it contradicts our belief. Many students with 
good grades in GA2 actually have bad grades in GB5. 
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Both-side unexpected rules visualization unit. We only have two unexpected match 
value boxes here, i.e., 1.00 and 0.50. Due to the formulas in Section 3.2, rules 



with bsUnexp.j < 1.00 can actually all be seen from other visualization units. The 
unexpected items can be obtained by clicking on the box above the one with a 
value. All the ranked rules can be obtained by clicking Both Sides Unexpected. 



Rank 1 


1.00 


R7 


Rank 1 


1.00 


R8 


Rank 1 


1.00 


R9 


Rank 1 


1.00 


RIO 


Rank 2 


0.50 


Rll 


Rank 2 


0.50 


R12 



GA4-1 ->GB7-2 
GA6-2 GB7-2 

GA5-l,GA2-2^GB2-2 
GA5-2, GAl-2 ^ GB3-2 
GA6-l,GA3-3^GB6-3 
GA7-2, GA3-3 ^ GB4-3 



From this ranking, we also see something quite interesting, i.e., average grades 
lead to average grades and bad grades lead to average grades. Some of these rules 
are common sense, e.g., average to average rules (R8 and RIO), but we did not 
specify them as our existing knowledge (if “average to average” had been specified 
as our knowledge earlier, these rules would not have appeared here because they 
would have been removed). This shows the advantage of our technique, i.e., it can 
remind us what we have forgotten if the rules are not truly unexpected. 



The system also allows the user to incrementally save interesting rules and remove 
unwanted rules, and to view them. Whenever a rule is removed or saved (also re- 



moved from the original set of rules), the related pictures and windows are updated. 

The proposed system has proven to be very useful in a number of applications. In 
these applications, there are typically thousands of discovered association rules (the 
smallest rule set has 770 rules). Without the proposed system, it would be very hard 
for us to analyze these large numbers of rules. 



5. Related Work 

Traditionally, a query-based approach is used to help the user identify or generate 
interesting rules. The approach takes many forms, e.g., templates [6], M-SQL [5], 
DMQL [4], and action hierarchy [1]. Although query languages can be quite differ- 
ent, a query basically defines a set of rules of a certain type (or constraints on the 
rules to be found). To “execute” a query means to find all rules that satisfy the query. 
We believe that the query-based approach is insufficient for two main reasons: 

1. It is hard to find the truly unexpected rules. It only finds those anticipated rules 
because what the user’s queries are still within his/her existing knowledge space. 

2. The user often does not know or is unable to specify completely what interest 
him/her. He/she needs to be stimulated. The query-based approach does not ac- 
tively perform this task because it only returns those rules that satisfy the queries. 

Our technique not only finds those conforming rules like query-based methods, but 
also provides three types of unexpected rules. Our approach also helps the user to 
provide more knowledge to the system by reminding him/her what he/she might have 
forgotten. If the top ranking rules are not unexpected, then they serve to remind the 
user what he/she has forgotten. Our visualization component allows the user to easily 
and quickly explore those interesting rules. 

[7, 8] report a related technique for analyzing classification rules [13] using user’s 
existing concepts. However, the technique there cannot be used for analyzing asso- 
ciation rules. Association rules require a different specification language and different 
ways of analyzing and ranking the rules. [7, 8] also do not have visualization systems. 
[12] proposes a method of discovering unexpected rules in the rule generation 
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phase by taking into consideration the user’s expectations. This method is, however, 
not as efficient and flexible as our post-analysis method because the user is normally 
unable to specify his/her expectations about the domain completely. User interaction 
with the system is needed in order for him/her to provide a more complete set of 
expectations and to find more interesting rules. However, user interaction is difficult 
for the approach in [12] because it is not efficient to run a rule miner whenever the 
user remembers another piece of knowledge. The association rule mining is typically 
very time consuming. Post-analysis facilitates user interaction due to its efficiency. 

6. Conclusion 

This paper proposes an integrated framework for exploration of discovered rules in 
order to find those interesting ones. The interestingness analysis system uses three 
types of user’s existing knowledge to analyze the discovered rules and to organize 
them in various ways to expose the user to many interesting aspects of the discovered 
rules. A simple but powerful visualization system enables the user to view and identify 
interesting rules easily and quickly. 
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Abstract. We introduce an interactive system which visualizes the knowl- 
edge in data mining processes, including attribute values, evolutionary 
attributes, associations of attributes, classifications and hierarchical con- 
cepts. The basic framework of knowledge visualization in data mining is 
discussed and the algorithms for visualizing different forms of knowledge 
are presented. The application of our initial prototype system, DVIZ, to 
Canada Education Statistics is described and some preliminary results 
presented. 
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1 Introduction 

Visualizing data mining should prove useful to various applications, e.g., stock 
market trends, oil streak patterns and smoke in fluid dynamics, multiple needle 
strip charts in seismology, factor analysis, etc. Data mining is popularly recog- 
nized as an important and difficult stage of knowledge discovery in databases [3], 
which is defined as the extraction of interesting knowledge (rules, regularities, 
patterns, constraints) from data in large databases. It is a process of search- 
ing and analyzing data in order to find implicit but potentially useful infor- 
mation, knowledge, regularities, overviews and even to automatically construct 
knowledge-bases [6] . 

Generally speaking, the knowledge discovered in databases is represented 
as various rules, e.g., generalization rules, characterization rules, classification 
rules, association rules, evolution rules, etc. These rules are always static. In 
most cases, users are often impatient to understand and use the results of data 
mining as quickly as possible, and they prefer to visualize the process and results 
rather than analyzing an abstract logical form. 

Recent research on data visualization techniques encompasses the effective 
portrayal of data with the additional goal of providing insights about the data[2]. 
Data visualization is a process of transforming the generated abstract data into 
a meaningful visual form so that users can understand the data more easily. 
Effective visualization makes a data mining system’s nature apparent at a glance. 

We briefly introduce visualization techniques and their applications to data 
mining. The visualization techniques employed in this paper include: 

N. Zhong and L. Zhou (Eds.): PAKDD’99, LNAI 1574, pp. 390-399, 1999. 
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— Individual attribute visualization, which is used to visualize the data of an in- 
dividual attribute in time series. Color intensity and brightness are functions 
of date. 

— Evolutionary attribute visualization, which is used to show the evolution of a 
specified attribute in the form of a curve with a background that is colored 
according to the attribute values in the time series. 

— Classification visualization, which uses a constraint slider to set the maximal 
number of classes to be classified; as long as the classifying attribute is 
specified, the objects will be classified and rendered in terms of the classifying 
attribute. 

— Attributes association visualization, which is used to show the association 
between two or three attributes. The associations are represented as a colored 
tube that is drawn at the specified point for each tuple of attributes values 
and colored according to the R, G and B values based on attributes values. 

— Hierarchical concepts visualization, which is used to visualize hierarchical 
data. Sliders are used to control the levels of hierarchical data. 

These techniques are discussed in greater detail below and illustrated in 
Figure 3 and Figure 4. 

These methods have been used to visualize Canadian Statistics Data of Ed- 
ucation according to the analysis of various data for each province, e.g., the 
number and the amount of full time student loans, secondary school graduates, 
full time (part time) and male (female) enrollment in colleges or universities, 
university undergraduate or graduate degrees granted, etc. By visualizing these 
data, we can directly and straightforwardly characterize, summarize, associate, 
and classify different types of data, and predict the trend of some events in terms 
of past data. 

We first summarize the data visualization techniques, and introduce the 
structure of our system, DVIZ, in Section 2. The principle and rendering of 
each subsystem are discussed individually in Section 3. The application of DVIZ 
to Canadian Education Statistics is presented in Section 4. Finally, in Section 5, 
we give conclusions and discuss our future work. 

2 System Structure 

2.1 Basic FVamework for Visualization 

Among various data visualization techniques, there is a basic common frame- 
work which are an extension of Abowd and Beale’s model of human-computer 
interaction [9]. This basic framework consists of four main components: User, 
Database, Visualization and Interaction, as shown in Figure 1. 

The User component describes the human user of the system, providing 
a description of his sophistication (e.g., expertise with the data visualized or 
with the system itself), the tasks which he wishes to perform (e.g., discovering 
associations between data, classifying data into categories, etc.), and his level of 
authority to view or modify data. 
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Visualisation 
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Fig. 1. Basic Framework for Visualization 



The Database component represents the actual database to be ’’mined” and 
visualized, encompassing the data model used, the schema and the instances 
which populate the database. 

The Visualization component presents the data to the user by using the 
different predefined metaphors such as color, shape, shadow, texture, etc. to rep- 
resent different items of data and arranging the physical layout on the screen. 
Generally, this component contains a subcomponent that extracts relevant in- 
formation from the database , processes the extracted information and transfers 
the results into metaphors. In our system, this subcomponent is the data mining 
process. 

The Interaction component addresses how the user may alter the presenta- 
tion of the data, or the data itself. This component interprets the user’s intention, 
the medium selected to realize the intention, and the effect of the action selected. 
Interaction effects may change the data, the visualization, or both. 

In our interpretation and implementation of this model, we have integrated 
the algorithms for mining data with the techniques used to visualize theses re- 
sults. This seamless integration is most apparent in the interaction component 
where the effect of an interaction (action at the user, e.g., moving a slider or 
picking an attribute) forces a recomputation of results for subsequent and simul- 
taneous display. 

Figure 1 shows this framework, where the interaction and the visualization 
components are combined together to form the user-friendly interface, which is 
usually implemented as an interactive visualization. 

This basic framework provides an integrated model, coordinating the sub- 
components. This point is particularly important to ensure that the resulting 
system is open and able to cooperate with other software systems. This frame- 
work facilitates incorporation of interactive components from various sources 
and provides for multiple visualization of the same data. 



2.2 System Architecture 

In our system, we use sliders and an attributes list for choices as an interac- 
tive method to set visualization parameters, and color, curve and square as a 
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visualization metaphor. The system consists of six subsystems, shown in Figure 
2, each of which can be used to visualize different data characteristics and/or 
the relationships among data. To implement our prototype, we adopt differ- 
ent visualization techniques to design the different subsystems. These visualiza- 
tion techniques primarily include geometric techniques, graph-based techniques, 
three-dimensional techniques, dynamic techniques, and simplified pixel-oriented 
techniques. 




Fig. 2. System Architecture 



The characteristics of this system can be described as follows: 

— The basic technique used among all subsystems is a dynamic technique, 
which permits visualization of the data by the dynamic interaction. 

— All subsystems use one or two sliders as the interactive method to control 
the visualization of data. 

— All subsystems can visualize the data of more attributes (higher dimensional 
data), but some can do so simultaneously, and others only do so sequentially. 
By choosing different attribute(s) from the attribute list, the user can visu- 
alize the data of a specified attribute (s). 

— The method of rendering is based on the maximum and minimum data of 
given attribute(s). Generally speaking, the object with maximum data is 
rendered more red (greater R value and less G value of RGB color), the 
object with minimum data is rendered more green (less R value and greater 
G value), and the object with intermediate data is rendered more yellow. 

3 Algorithms of Visualization 

3.1 Individual Attribute Data Visualization 

When the attribute to be visualized is specified, as the date slider is moved, the 
color of each object area will be recalculated with respect to the attribute values 
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in time series. The algorithm for RGB color calculation follows. 

For specified attribute and date (time) 

Find the maximum and minimum values among all provinces 
For each province 

Find the value(val) of the specified attribute in the 
specified date 

Calculate the R, G, B corresponding to this value: 
r=val/max+256 

g=256+(l-(val-min)/(max-min) ) 
b=0 



3.2 Evolutionary and Additional Attribute Visualization 

Evolutionary attribute visualization employs curves to visualize the data of a 
specified attribute during a period, which can be used to predict the trend of 
the attribute. 

The attribute to be visualized can be chosen from the attributes list. This 
subsystem will calculate the values of RGB color for each area based on the 
average values of the attribute, and draw a two-dimensional coordinate with a 
curve in each area to show the trend of the attribute for different years. In the 
coordinates shown, the horizontal axis and vertical axis are date and attribute 
values respectively. The curve is a polyline, the points of which are calculated 
as follows, suppose (X, Y): 

For each area and each year 

X= ( current _year-min_year )/ (max_year-min_year ) +X_len 
Y= ( current _value-min_value ) / (max_value-min_value) +Y_len 

The RGB color calculation of each area is computed by the following algo- 
rithm. 

For the specified attribute 

Find the maximum and minimum average values during the period 
For each area 

Find value of the specified attribute in the specified year 
Calculate the R, G, B corresponding to this area: 
r=val/max+256 

g=256+(l-(val-min)/(max-min) ) 
b=0 



Additional attributes evolutionary visualization employs the same method 
as above but for values of more than one attribute during a period, which can 
be used to predict the trends of attributes, and compare among the additional 
attributes. This method is different from evolutionary attribute visualization in 
that only one object area is visualized with respect to additional attributes at a 
time. 
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3.3 Classification Visualization 

Classification visualization is used to classify areas like provinces and territories 
based on the specified classifying attribute, and to visualize each class with dif- 
ferent colors that are calculated in terms of the values of the classifying attributes 
of those classes. 

This subsystem finds the corresponding values for all areas, and partitions 
the interval [minimum value, maximum value] into subintervals, the number of 
which is specified by the class number slider. Based on this partition, the sum of 
the classifying attribute values in the same class will be found and used as the 
classifying attribute value for this class. Finally, the classifying attribute values 
are used to calculate the colors of classes in the same way as that of individual 
attribute visualization. The rendering algorithm is as follows: 

Find the maximum and minimum values of the classifying attribute 
for all classes 
For each class 

Find the sum of attribute values 
Calculate the R, G, B: 
r=sum/max+256 

g=(l-(sum-min)/(max-min) )+256 
b=0 



3.4 Attributes Association Visualization 

Attributes association visualization is used to visualize the relationships among 
associated attributes [1, 5]. For two or three specified associated attributes, a 
cube will be drawn in the corresponding area for each year. All cubes have the 
same size, but each cube’s color is calculated based on the values of associated 
attributes in different years. For two associated attributes, only R- and G- values 
of RGB color are calculated and the B- value default 0. For three associated 
attributes, R-, G-, and B-values will all be calculated, where each attribute 
corresponds to one value of R, G, and B. Thus, this method can visualize five 
dimensional data. 

After two or three associated attributes are specified, the rendering algorithm 
works as follows: 

Find max and min of the first attribute, R_max and R_min 

Find max and min of the second attribute, G_max and G_min 

Find max and min of the third attribute, B_max and B_min 

For each area (object) and each year 

Find the values R_val , G_val , and B_val of three attributes 
Calculate color: 

r= (R_val-R_min ) / (R_max-R_min ) +256 
g= (G_val-G_min ) / (G_max-G_min ) +256 
b=(B_val-B_min)/(B_max-B_min)+256 
Draw a cube at the origin with RGB 
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Rotate around X and Z axes , respectively 
Translate to the corresponding position 



3.5 Hierarchical Concepts Visualization 

Hierarchical concepts (in the sense of [3, 10] ) visualization is used to visualize 
the hierarchical data (concepts) of a specified attribute. When the highest level 
concept is chosen, all objects will be rendered in the same color. While the lowest 
level concept is chosen, all objects will be rendered in the same way as that in 
individual attribute visualization. While an attribute to be visualized is chosen 
and the level of concepts is specified by moving sliders, the system will find the 
maximum and minimum attribute values for the concepts in the current level, 
calculate the color, and then draw the objects that the concepts cover. The 
following is the rendering algorithm. 

Find max and min attribute values among all current concepts 
For each current concept 
Find its attribute value 
Calculate the color 
r=val/max+256 

g=(l-(val-min)/(max-min) )+256 
b=0 



4 DVIZ Implementation and Application to the Canada 
Education Statistics 

We have used DVIZ to visualize the data set that are excerpted from “The 1997 
year book” [11] published by the Canadian government. We chose 21 statistics 
indices from among them. The collection of data to be visualized is from the 
time period 1990 to 1995. If there do not exist statistics data in some year 
for some province, we use zero to represent these statistics and render them in 
black (that is, RGB color = (0, 0, 0) ). Due to space limitations, we simply 
introduce two subsystems of our DVIZ implementation, evolutionary attributes 
and hierarchical concepts. Other attributes which our implementation considers 
from the Canada Education Statistics data set include individual attributes, 
classification attributes, association attributes and hierarchical attributes. 

In Figure 3 we show the application of DVIZ in visualizing the trend of at- 
tribute Per Capital Health Expenditure. Each province area is rendered according 
to the average attribute value during the period, and its corresponding curve is 
drawn in that area. 

In Figure 4 we visualize the third level concepts in a hierarchy of geographic 
concepts. We divide the geography of Canada into hierarchical concepts. The 
highest level concept is the entire country- Canada, which consists of two second 
level concepts- Eastern Canada and Western Canada. Ea.stern Canada is com- 
posed of Atlantic Region and Central Canada, and We.stern Canada is composed 
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Fig. 3. Evolutionary Attribute-Per Capital Health Expenditure 



of The Prairies, British Columbia, and The North. At the lowest level are each 
province and territory. In Figure 4, the attribute Secondary School Graduates in 
1995 is chosen. 

DVIZ is implemented under X windows on an SGI IRIS workstation with 
Motif and GL (Graphics Library), where Motif is used to control the interac- 
tive interface, and GL is used to draw layout on the draw window. The source 
program consists of about 12600 lines of C code. 

5 Conclusions and Future Work 

We briefly introduced the system structure of knowledge visualization in data 
mining that we implemented and its application to the visualization of the Cana- 
dian Education Statistics. We discussed the basic ideas and algorithms of six 
visualization methods developed for our system. In these visualization methods, 
we utilized several important visualization techniques, including geometric pro- 
jection techniques [7], dynamic techniques [8], hierarchical techniques, etc. These 
visualization methods cover almost all data mining rules which may be discovered 
so that our methods should prove useful to visualize the results and the processes 
of data mining. In our implementation, we emphasize the interaction between 
the computer and the human, since we believe that interactive visualization [4] 
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Fig. 4. the concepts in level 3 visualization in 1995 



plays the most important role in data mining to guide the process interactively of 
discovering knowledge. Our system has demonstrated that it is useful for users, 
by visualizing a large data set, to understand the relationships among data and 
to concentrate on the meaningful data to discover knowledge. 

Our next task is to improve this system and we will focus on the represen- 
tation of geometries and the combination of visualization techniques and data 
mining algorithms. 

The capability of this system will be expanded to visualize not only the static 
results of data mining but also the dynamic processes of discovering knowledge 
in large data set. Combining the visualization and data mining algorithms will 
produce a much more efficient method of knowledge discovery. Our primary 
thinking is first to use visualization techniques to limit the domain of data by 
interacting with user and then to mine the data to discover rules, and finally to 
visualize the resulting knowledge. 

The geometries representing different knowledge will be developed to ensure 
the graphics have clearer meaning and are more easily understood. The corre- 
spondence between the knowledge and these geometries should be direct and 
straightforward. There is a contradiction between the straightforwardness and 
the expressiveness of the geometries, however, so how to make the trade-off be- 
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tween them is being considered. The arrow icon or fluid flow can be used to 
represent the evolutionary rule, the association rule can be visualized as an an- 
imation in which the decision attribute geometry varies in shape and/or color 
with the condition attributes geometries, the classiflcation rule can be repre- 
sented as the ranges of 3 D-dimensional space, and so on. 

We realize that what we have presented is an initial step toward the spec- 
iflcation of an integrated data mining and visualization system and we expect 
significant further developments exploiting these humble beginnings shortly. 
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Abstract 

The minimal-model semanties of eausation is an essential eoneept for the identifieation of 
a best fitting model in the sense of satisfaetory eonsistent with the given data and be the 
simpler, less expressive model. Therefore to develop an algorithm being able to derive a 
minimal model is an interesting topie in the area of eausal model diseovery. various eausal 
induetion algorithms and tools developed so far ean not guarantee that the derived model is a 
minimal model. This paper proves that the MML induetion approaeh introdueed by Wallaee, 
et al is a minimal eausal model learner. The experimental results obtained from the tests 
on a number of both artifieial and real models provided in this paper eonfirm this theoretieal 
result. 



Keywords: Minimal model, causal discovery, data mining, machine learning, knowledge 

acquisition, artificial intelligence. 



1 Introduction 

Given a data set D with m instances {f, I < i < m} and n variables {xj, 1 < j < n}, the 
central task of causal discovery is to find a causal model M which provides a satisfactory 
explanation to the data D. In the model space, there are a great number of possible models 
which may fit the data. If a causal variable is represented by a from node and an effect 
variable is represented by a to node, then a causal model can be represented by a directed 
acyclic graph. So the number of causal models for a given data set of n variables is the 
number of directed graphs with n nodes. It is known that the number of directed graph with 
n nodes is pl82]. To search for a right model within this large number of possible 



N. Zhong and L. Zhou (Eds.): PAKDD'99, LNAI 1574, pp. 400-409, 1999. 
(^ Springer-Verlag Berlin Heidelberg 1999 




A Minimal Causal Model Learner 401 



models is extremely difficult. Naturally, we could consider the problem in two major aspects. 
The first one is how to find a causal model from the model space which explains the data 
well. It is obvious that there are many such models. The second question is that among all 
the possible models which provides a satisfactory explanation to the given data set, which is 
the best model and how can we find it? A widely adopted general strategy is to choose the 
simpler, less expressive model. The reason might be that scientists prefer simpler models. 
This idea exactly fit the idea of MML principle in choosing a model with less descriptive 
complexity. 

In this paper, section 2 gives a preliminary description of the basic concepts which include 
definitions. In section 3, we briefly overview the MML induction approach. In section 4, we 
introduce a theorem which proves that a causal model derived from a given data set is a 
minimal model. Section 5 provides four group experimental results which evidence our claim 
made in sectiond. The final section gives a conclusion. 



2 Preliminary 

We view a causal model as a representation of a (a set of) regularities which explain the 
given data. Among all the possible models there are some relations and special features 
which are interesting and are important to the implementation of our central task of causal 
discovery. 

Equivalent Model. The basic relation between two models is the equivalent relation. Given 
a model space fl, and two models Mi, M2 G fl. Mi and M2 are said to be equivalent if they 
represent the same set of probability distributions, denoted as Mi = M2- In other words, 
two causal models are equivalent if the set of probability distributions, corresponding to the 
causal structure, that can be represented using one of the DAGs is identical to the set of 
probability distributions that can be represented using the other. 

Lemma 2.1 Equivalent Model 1. Let LI be a model spaee, two eausal models Mi, M2 G 
Omega are equivalent iff Mi M2 and Mi ^ M2. 

Lemma 2.2 Equivalent Model 2.. Two eausal models are equivalent if they define the 
same probability distributions. 

Lemma 2.3 Equivalent Model 3. Two DAGs are equivalent iff they have the same 
skeleton and the same V-strueture. 

Equivalent Class. A set of models A is said to be an equivalent class of a model M, if the 
set is composed of and only of the models which are equivalent to M. 

Definition 2.1 Equivalent Class. Let be a eausal model, M is a set of eausal models, 
if\tm G M,m -< M, and ^{m M) A (m/ ^ M), then we eall M the equivalent elass of 
the eausal model and denoted as [mo]. 
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Definition 2.2 Consistency. Let D be a data set, P is a distribution over D and M is a 
strueture indueed from D. We say the strueture is eonsistent with the data D, if the D ean 
aeeommodate M by whieh D ean be generated with the distribution P. 

Minimal Model. Scientists prefer simpler models because such models are more constrained, 
thus more falsifiable; they provide scientists less opportunity to overfit the data hindsightedly 
and, therefore attain greater credibility [4, 5]. Following Judea Pearl’s explanation, a minimal 
model can be defined as below. 

Definition 2.3 Minimal Model. Let D be a given data set, M be a model derived from 
D by a eausal model learner A. M is said to be a minimal model, iff M G [M ], where M 
is the simplest model eonsistent with D, and [M ] is the equivalent elass of the model M . 

In other words, a causal model M discovered from a given data set D is said to be a minimal 
model iff we can not find a simpler, less expressive model, eqally consistent with the data. 



3 MML Induction 

Given a data set D oi m samples. Each sample has n values of the n variables, i.e., 

sample 1: Xn, Xi2, . . . , 

sample 2: X21, X22, . . . , X2„ 

sample m: , . . . , 

Our task is to induce the causal model with the highest posterior probability from this given 
data set. 

In [7], Chris Wallace and his fellows give a causal model discovery approach based on Mini- 
mum Message Length (MML) principle. The basic idea of the MML induction approach is to 
measure a causal model with message length and choose the model with minimum message 
length. The algorithm can be described as follows. 

Algorithm 3.1 MML-Induction. 

Step 1: Describe the causal model. Give a general deseription of a eausal model whieh 
fits the given data set. 

Step 2: Encode the causal model. Give a general formula of the eost for eneoding the 

eausal model. 

Step 3: Minimize the cost. To ehoose the unknown parameters of the general formula 

so that the eost for eneoding the eausal model being minimized. 

Step 4: Search for the causal model. To seareh the eausal model in the model spaee 

so that the model being seleeted has the minimum message length. 
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Encoding causal models. The total message length for encoding the causal mode is 



L = -lnP(M) - lnP{D\M) = L(M) + L(D\M) (1) 

where L{M) = — hi 2 P{M) is the cost (in number of bits) of encoding the causal model M, 
and L[D\M) = — h\ 2 P{D\M) is the cost of encoding the sample data D given the model 

M. 



L{M) = 

LW =lnn! + ^^^~^^ -Inp 
where p is the total orderings consistent with the DAG. 



( 2 ) 

( 3 ) 



1 

= -(lu2 + lnfV-2(A' + l)lnCT + ln|A|)-y'ln(^ — 
2 \J‘l'Kaa 

n 1^11 

= - In 27 t + n log a + E + 2 H‘^^) + 2 1^1 

k=l 



g(-apaV) 



And the message length for encoding the data given the model is given by 



( 4 ) 

( 5 ) 



L[D\M) = —lnP(data\parameters) 

= -lnP{y\a,{aj}) 

^ 1 

N r'^ 

= -^log2-K + Nloga + 2^ ^ 



( 6 ) 

( 7 ) 

( 8 ) 

(9) 



The combined message length for encoding the parameters and the data given the model is 



+L(D\H) 



M -\- T<C ^ ^ ^ 

— - — log2ii + Kloga + Nloga + + E ~|) (40) 

+ 2^og{2N) + -log\A\ (11) 



Minimizing the Costs. To choose the unknown parameters oe and aj; (1 j n) so that the 
combined message length be minimized, we examine the partial derivatives with respect to 
the unknown parameters a and the aj {1 < j < n): 



d{L(p'i + L{D\H)) 



N 




a 



( 12 ) 
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Therefore, 




And for j = 1,2,. . . , K, 



d{L<~P^ + L{D\Hj) 
daj 



1 



(2^ri( Xij 
2 = 1 




2 = 1 




Let, 



d{L^P^ + L{D\H)) 

d{L<~p'> + L{D\H)) 
daj 



(13) 

(14) 

(15) 

(16) 



Letting a = 1, we have 
where 



2 = 1 

{A + I)a = h 





/ ai \ 




/ y-xi \ 


a = 


a-2 




y ■ X 2 




V O.K ) 




\y-XK ) 



(17) 

(18) 



(19) 



and Xj = (xij,X 2 j, ■ ■ ■ , xjvj) for j = 1,2, ... ,K and y = {yi, y^,..., yN)- ^ is the K x K 
square matrix A = (aij)fcxK = {xi ■ Xj)KxK and I is a K x K unit matrix. So 



Hence, the final solution is 



a — (A + /) 



a = (A + /)-!& 



( 20 ) 



( 21 ) 



where o? « l,ri = yi — o-k^ik- Adding these values to l[‘'’ produces the total MML 
cost for a given model relative to the input data. 

The {oj, (1 < i < fi)} is the path coefficients of the derived causal model. 



4 Minimal Model Theorem 

Prom the derivation of the MML-CI algorithm, we can get the following theorem. The 
theorem states that the model derived by MML-CI algorithm is a simpler and less expressive 
model among all the models which are consistent with the given data set. 
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Theorem 4.1 Minimal Model Theorem. 

Let D be a training data set with suffieient large sample size. A eausal model diseovered via 
MML prineiple from the given data set D is a minimal model. 

Proof: 



The pre-condition that given a training data set with sufficient large sample size is a condition 
which ensure that the data set provides enough information so a learner is able to find a 
right model from it. It does nothing with if the MML induction is minimal model learner. 
So in the following proof we will avoid dealing with this condition. 

Let D he & given data set with m samples and each sample has n values of attributes; M 
be the derived model from D via MML. To simplify the problem we assume that the model 
contains one dependent variable only, i.e., 



Vi = Y^ajXij +r, (22) 

With i = 1,2, ... ,m, where variable means are assumed to be zero, ~ A^(0, cr^), Xy are 
the ith observation of the independent variable Xj and sigma, {aj\l < j < n} unknown. 

The second part of the message length L{D\M), the message length for encoding the data 
D given the model M is given by 

L{D\H) = -logP{y\a,{a,}) (23) 

N 1 ^ 

= -/o5[TT (24) 

\/27ra‘^ 

_/V 

= — /0527t -b Nloga + ^ (25) 

i=l 



As the unknown parameters are selected so that 

d{L<~p'> + L{D\H)) 

daj 



■^~^px,j = 0 



(26) 



(27) 



This optimization procedure ensure that the derived model M consists with the data D. 
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Secondly, in MML induction the model is selected from the model space using the criterion 



mini — rs\m[L(M) + L(D\M)] 



(28) 



This means that, 

The length of M is the length of M' . 
where M' is the model which is consistent with the given data. 

This proves that the MML induction derived model M is a simpler, less expressive model. 

Therefore the model discovered via MML induction from a given data set D is & minimal 
model. □ 

If MML induction is equivalent to the Glymour’s constraints on correlation based approach 
and Pearl and Verma’s conditional independency based approach? Or if MML induction can 
do what the other two approaches do? 

What is the advantage of MML induction? What are the differences and similarities among 
MML induction, Glymour’s approach of constraints on correlation and Pearl and Verma’s 
approach based on conditional independency? 



5 Experimental results 

Three groups of experiments have been conducted to show that the causal models derived 
form a given data set applying MML induction are (1) consistent with the given dat a set; 
(2) the derived models are the simpler, less expressive ones. In the first two experiments, 
each one use a group of artificial models. The experiment 3 uses real models from literatures 

Experiment 1: 



The first experiment we performed was for the given four artificial models as shown in 
Figure 1 (a)(b)(c)(d), we generate a sample data set with 5000 samples for each one of the 
four models, and then derives a model from each one of the generated data sets. The induced 
models are shown in Figure l(a’)(b’)(c’) and (d’). Prom the results, we can easily find that 
derived models have the exactly the same structure as the original models. We can also find 
that the models (a’), (b’), (c’) and (d’) are consistent with (a), (b), (c) and (d) respectively 
as the data sets were generated from the original model. That is to say, the derived models 
are consistent with the data sets generated from (a), (b), (c) and (d) respectively. It also 
can be verified that we can not find a model simpler, less expressive model than (a’), (b’), 
(c’) or (d’) which is not belongs to [(a’)], [(b’)J, [(c’)] or [(d’)]. These results confirm that 
the derived models are minimal models. 



Experiment 2: 
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Model 1 Model 2 Model 3 Model 4 







Figure 1: MML-CI Test Results on Artificial Models 1 



In the experiment 2, we again give four artificial models as shown in Figure 2(a)(b)(c)(d). 
This time, we generate a data set with 1000 samples for models (1) and (2); and generate 
a data set with 5000 samples from models (3) and (4) respectively. Because the models (1) 
and (2) are less complexity in term of the number of nodes and the number of links. As 
we can see from the derived models illustrated in Figure 2(a’)(b’)(c’) and (d’), the models 
(a’), (b’) and (c’) has the exactly the same structure as the original models (a), (b) and (c) 
which we applied to generate the data sets. The derived model (d’) has the exact the same 
structure as the original model (d) except that the orientation of the arc in between the 
variables A and C is reversed. However the model (d’) is statistically equivalent to (d), i.e., 
(dO) 2 [(d)]. 

Experiment 3: 



Figure 3 illustrates the test results of the MML-CI systems on four real models. Model (a) 
in Figure 3 is the Blau’s stratification process model of occupation [2]. Model (b) in Figure 
3 is the Retherford and Choe’s model of the fertility of Fijian women. Model (c) in Figure 3 
is the Goldberg’s mediation model [1, p43] of vote performances. Model (d) in Figure 3 is 
the model of Verbal and Mechanical Ability described in [3, ppl62-185j. In Figure 3, the 
models (a’), (b’), (c’) and (d’) are the derived models. The derived models (c’) and (d’) have 
the exactly the same structures and the same orientations as the original models (c) and 
(d) respectively. Model (a’) and its original model (a) has the exactly the same structure 
except one link in between D and F with the reversed orientations. It is obvious that model 
(a) is equivalent to model (d). So (a’) is consistent with the data generated from model (a). 
The interesting thing is that the derived model (b’) has a simpler structure than the original 
model (b). It dropped two weak links. It has smaller message length as it is simpler. The 
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Figure 2: MML-CI Test Results on Artificial Models 2 

expected log likelihood of the model (a) and the model (a’) is almost the same, of course, 
the model (a) is consistent with the generated data so does the model (a’). 



6 Conclusions 

Minimal model learner is an ideal approach for the discovery of causal models from given 
data set. This paper proves that the MML induction introduced by Chris Wallace and his 
companies is a minimal model learner given that a data set with sufficient large sample size 
is provided. The experimental results confirms this theoretical result. It shows that the 
model derived by MML induction is the simpler, less expressive model which consistent with 
the given data. This result is significant because it indicates that the MML induction is 
a minimal model learner which is able to derive the best model in term of the descriptive 
complexity and accuracy. 
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Abstract. In this paper, we study the issues of mining and maintaining 
association rules in a large database of customer transactions. The problem of 
mining association rules can be mapped into the problems of finding large 
itemsets which are sets of items bought together in a sufficient number of 
transactions. We revise a graph-based algorithm to further speed up the process 
of itemset generation. In addition, we extend our revised algorithm to maintain 
discovered association rules when incremental or decremental updates are made 
to the databases. Experimental results show the efficiency of our algorithms. 
The revised algorithm significantly improves over the original one on mining 
association rules. The algorithms for maintaining association rules are more 
efficient than re-running the mining algorithms for the whole updated database 
and outperform previously proposed algorithms that need multiple passes over 
the database. 



1 Introduction 

Data Mining has been considered as a new area in database research [5]. When a 
huge amount of data has been collected in large databases, it is quite important to 
extract potentially useful knowledge embedded in it. To extract such previously 
unknown knowledge from large databases is the task of data mining. Various types of 
knowledge can be mined from large databases, such as mining characteristic and 
classification rules [8,9], association rules [1,2,10], and sequential patterns [6,10]. 

Data mining has been widely applied in retail industry to improve market 
strategies. Each customer transaction stored in the database typically consists of 
customer identifier, transaction time, and the set of items bought in this transaction. It 
is important to analyze the customer transactions to discover customer purchasing 
behaviors. The problem of mining association rules over customer transactions was 
introduced in [1]. While new transactions are added to a database or old ones are 
removed, rules or patterns previously discovered should be updated. The problem of 
maintaining discovered association rules was first studied in [3], which proposed the 
FUP algorithm to discover new association rules when incremental updates are made 
to the database. The algorithm proposed in [4] improve FUP by generating and 
counting fewer candidates. 
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The graph-based algorithm DLG proposed in [10] can efficiently solve the 
problems of mining association rules. In [10], DLG is shown to outperform other 
algorithms which need to make multiple passes over the database. In this paper, we 
first propose the revised algorithms DLG* to achieve higher performance. Then we 
develop two algorithms DIUP (DLG* for Incremental Updates) and DDUP (DLG* 
for Decremental Updates), which are based on the framework of DLG*, to handle the 
problem of maintaining discovered association rules in the cases of insertion and 
deletion of transaction data. 

This paper is organized as follows. Section 2 gives detailed descriptions of the 
above two problems. The algorithms DLG* is described in Section 3. The algorithms 
DIUP and DDUP for maintaining association rules are described in Section 4. 
Experimental Results are discussed in Section 5, and Section 6 concludes our study. 



2 Problem Descriptions 



2.1 Mining association rules 

The following definitions refer to [2]. Let I = be a set of literals, called 

items. A set of items is called an itemset. The number of items in an itemset is called 
the length of an itemset. Itemsets of length k are referred to as k-itemsets. Let D be a 
database of transactions, a transaction T contains itemset X if and only if A C T. An 
association rule is an implication of the form X ! Y, where AC/, Y^I, and 
AD Y= (f> . The support count of itemset A, sup^, is the number of transactions in D 
containing A. The association rule A ! Y has support s% if s% of transactions in D 
contain AD Y, i.e. sup^jj / IDI = s%. The association rule A ! Y has a confidence c% 
if c% of transactions in D that contain A also contain Y, i.e. sup^^ J sup^ = c%. 

The problem of mining association rules is to generate all rules that have support 
and confidence greater than the user specified thresholds, minimum support and 
minimum confidence. As mentioned before, the problem of mining association rules 
can be divided into the following steps: 

1 . Find out all frequent itemsets that have support above the user specified minimum 
support. Each such itemset is referred to as a large itemset. The set of all large 
itemsets in D is L, and is the set of large k-itemsets. 

2. Generate the association rules from the large itemsets with respect to another 
threshold, minimum confidence. 

The second step is relatively straightforward. However, the first step is not trivial if 
the total number of items I/I and the maximum number of items in each transaction 
IMTI are large. For example, if I/I =1000, \MT\ = \0, there are 2™“ possible itemsets. 

We need to identify large itemsets among ^ ^looo _ ^ 55 x 1 potentially large 

;=1 ' 

itemsets. Therefore, finding all large itemsets satisfying a minimum support is the 
main problem of mining association rules. 
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2.2 Update of association rules 

The following definitions refer to [4]. Let L be the set of large itemsets in the 
original database D, and s% be the minimum support. Assume the support count of 
each large itemset X, sup^, is available. Let ct (d~) be the set of added (deleted) 
transactions, and the support count of itemset X in ct (d~) is denoted as sup*^ (sup ,) . 
With respect to the same minimum support s%, an itemset X is large in the updated 
database D’ if and only if the support count of X in D’ , sup\, is not less than 
ID ’I X s%, i.e. {sup^ - sup\ + sup*J>(\D\-\d^\ + \d*\)X s%. 

Thus the problem of updating associations rules is to find the set of new large 
itemsets L’ in D’. Note that a large itemset in L may not appear in U . On the other 
hand, an itemset not in L may become a large itemset in U . 



3 The Revised Algorithm DLG* 



DLG is a three-phase algorithm. The large 1 -itemset generation phase finds large 
items and records related information. The graph construction phase constructs an 
association graph between large items, and at the same time generates large 2- 
itemsets. The large itemset generation phase generates large Ditemsets (k>2) based 
on this association graph. The DLG* algorithm reduces the execution time during the 
large itemset generation phase by recording additional information in the graph 
construction phase. 



3.1 Large 1-itemset generation phase 

The DLG algorithm scans the database to count the support and builds a bit vector 
for each item. The length of a bit vector is the number of transactions in the database. 
The bit vector associated with item i is denoted as BV.. The yth bit of BV is set to 1 if 
item i appears in the jth transaction. Otherwise, the jth bit of BV. is set to 0. The 
number of I’s in BV. is equal to the support count of the item i. 

For example. Table 1 records a database of transactions, where TID is the 
transaction identifier, and itemset records the items purchased in the transaction. 
Assume the minimum support count is 2 transactions (IDI X minimum support). The 
large items and the associated bit vectors are shown in Table 2. 



T Qraf> itf»m 
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010111 


2 


1001011 


3 


1111010 


4 


1001101 


5 


1000001 


6 


1010000 



TID 


Itemset 
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23456 


200 


137 


300 


36 


400 


1234 


500 


1 4 


600 


123 


700 


245 



Table 1. A database of transactions 



Table 2. The bit vectors of large items in Table 1 
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3.2 Graph construction phase 

The support count for the itemset is the number of I’s in 

aBV.j A ... AZ?y,j, where the notation “A” is a logical AND operation. Hence, 
the support count of the itemset can be found directly by applying logical 

AND operations on the bit vectors of the k itemsets instead of scanning the database. 
If the number of I’s in BV.aBV. (i<j) is not less than the minimum support count, a 
directed edge from item i to item j is constructed in the association graph. Also, [ij] 
is a large 2-itemset. Take the database in Table 1 for example, the association graph is 
shown in Figure 1, and L={{ 1,2}, {1,3}, {1,4}, {2, 3}, {2, 4}, {2, 5}, {3, 4}, {3, 6}, {4, 5}}. 





Fig. 1. The association graph for Table 1 Table 3. The related information recorded by 

DLG* for Figure 1 



3.3 Large itemset generation phase 

For each large A:-itemset {i,,;^,...,!^} in (A:>1), the last item is used to extend the 
itemset into (A:+l)-itemsets. If there is a directed edge from to item j, the itemset 
is a candidate (A:+l)-itemset. If the number of I’s in aBF^ ... A 
BV.^aBV. is not less than the minimum support count, {i,,ij,...,(,j} is a large (A:+l)- 
itemset in If no large A:-itemset is generated in the A:-th iteration, the algorithm 
terminates. 

Consider the above example, candidate large 3-itemsets can be generated based on 
L,. The candidate 3-itemsets are {{1,2,3}, {1,2,4}, {1,2,5}, {1,3,4}, {1,3,6}, {1,4,5}, 
{2,3,4}, {2,3,6}, {2,4,5}, {3,4,5}}. After applying the logical AND operations on the 
bit vectors of the three items in each candidate, the large 3-itemset L^={{ 1,2,3}, 
{2,3,4}, {2,4,5}} is generated. The candidate 4-itemsets are {{1, 2,3,4}, {1,2, 3, 6}, 
{2,3,4,5}}. After applying the logical AND operations on the bit vectors of the four 
items in each candidate, no large 4-itemset is generated. Thus, the algorithm 
terminates. 



3.4 Improvements over DLG 

In the A:-th {k>2) iteration, DLG generates candidate A:-itemsets by extending each 
large (k- 1 {-itemset according to the association graph. Suppose on the average, the 
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out-degree of each node in the association graph is q. The number of candidate 
itemsets is IL^JX q, and DLG must perform IL^ JX qX (A:-l) logical AND operations 
on bit vectors to determine all large A:-itemsets. The key issue of the DLG* algorithm 
is to reduce the number of candidate itemsets. The following properties are used by 
DLG* to reduce the number of candidates. 

Lemma 1 If a (k-i-l)-itemset {k> 2), then for 

\<j<k, that is, item is contained in at least k large 2 -itemsets. 

In the large itemset generation phase, DLG* extends each large k-itemset in 
(k> 2) into (kH-l)-itemsets like the original DLG algorithm. Suppose is a 

large k-itemset, and there is a directed edge from item to item i. From Lemma 1, if 
the (kH-l)-itemset is large, it must satisfy the following two conditions 

(Otherwise, it cannot be large and is excluded from the set of candidate (k-i-1)- 
itemsets). 

1. Item i must be contained in at least k large 2-itemsets. In other words, the total 
number of in-degrees and out-degrees of the node associated with item i must be at 
least k. 

2. Any {L,i} (l<j<k) must be large. A directed edge from to item i means that 
{ij} is also a large 2-itemset. Therefore, we only need to check if all {i.,i} 
(1 < j<k-l) are large. 

These simple checks significantly reduce the number of candidate itemsets. In 
order to speed up these checks, we record some information during the graph 
construction phase. For the first condition, for each large item, we count the number 
of large 2-itemsets containing this item. For the second condition, a bitmap with 
IL,I X ILJ bits is built to record related information about the association graph. If there 
is a directed edge from item i to item j, the bit associated with [ij] is set to 1 . 
Otherwise, the bit is set to 0. DLG* requires extra memory space of size quadratic to 
I/I, but speeds up the performance significantly. 

Let’s illustrate the DLG* algorithm with the example in Table 1. The extra 
information recorded by DLG* is shown in Table 3. For large 2-itemset {1,2}, it can 
be extended into {1,2,3}, {1,2,4}, and {1,2,5}. Consider {1,2,3}, item 3 is contained 
in 4 (>2) large 2-itemsets, and the bit associated with {1,3} in the bitmap is 1. 
Therefore, {1,2,3} is a candidate. Consider {1,2,5}, item 5 is contained in 2 (>2) 
large 2-itemsets, but the bit associated with {1,5} in the bitmap is 0. Therefore, 
{1,2,5} is not a candidate. In the third iteration, DLG generates 10 candidates, but 
DLG* only generates 5 candidates ({1,2,3}, {1,2,4}, {1,3,4}, {2,3,4}, {2,4,5}). For 
large 3-itemset {1,2,3}, it can be extended into {1, 2,3,4}, {1,2, 3, 6 }. Consider 
{ 1,2, 3,4}, item 4 is contained in 4 ( > 3) large 2-itemsets, and the bits associated with 
{1,4} and {2,4} in the bitmap are both 1. Therefore, {1,2,3, 4} is a candidate. 
Consider {1, 2,3,6}, item 6 is contained in 1 (<3) large 2-itemsets. Therefore, 
{1,2, 3,6} is not a candidate. In the fourth iteration, DLG generates 3 candidates, but 
DLG* generates only 1 candidate {1, 2,3,4}. In this example, DLG* has reduced the 

13-6 

number of candidate k-itemsets (k>2) by -13 -^100% — 54%. 




414 



K.L. Lee et al. 



4 Efficient Update Algorithms based on DUG* 

In this section, we introduce two update algorithms for transaction insertion and 
deletion. The algorithms DIUP and DDUP are based on the framework of DLG* , 
which can be split into three phases. As in [3], we assume the support counts of all 
large itemsets found in the previous mining operations are available. If a candidate 
itemset X is large in the original database, we can directly get the support count sup^ 
in the original database D. Otherwise, we must apply logical AND operations on the 
bit vectors associated with D to find the support count sup^. However, we can use the 
following properties to reduce the cost of performing logical AND operations. As 
defined in section 2, sup* is the support count of itemset x in the set of inserted 
transactions (T^and sup^ is the support count in the set of deleted transactions d\ The 
following lemma is similar to Lemma 4 in [4] . 

Lemma 2 If an itemset X is not large in the original database, then X is large in 
the updated database only if sup*- sup^>{\d*\-\d\)xs%. 

For each A:-itemset X not in L^, we first apply logical AND operations on the bit 
vectors associated with the changed part of the database (d* and d ). For the itemsets 
X satisfying sup*- supj < {\d*\-\d\)'Ks%, we can determine they will not be large in L’ 
without applying logical AND operations on the bit vectors associated with the 
unchanged part of the database (D- d ). In the following, we describe the DIUP and 
DDUP algorithms for the cases of insertion (W 1=0) and deletion (W*l=0) in detail. 

4.1 Large 1-itemset generation phase 

The DIUP algorithm scans the set of inserted database d* and builds a bit vector 
BV*for each item i. The number of I’s in BV* is equal to the support count of the 
item i in d*. In order to determine which item is large in the updated database D ’ more 
efficiently, DIUP requires the support count of each item in the original database D be 
stored in the previous mining operation. Hence, we can directly calculate the new 
support count sup\.^ = sup sup* If an item i is large in D’, DIUP scans the original 
database and builds a bit vector BVy After completing the phase, the bit vectors BV. 
and BV* for each large item i are available. This requires extra storage space of size 
linear to I/I to store sup^.^ for each item i, but reduces the cost of building bit vectors 
and counting supports for those items not large in the updated database. For each 
large item i, we allocate two bits i.A and i.& We set the bit i.A (i.S) to 1 if item i is 
large in (D) (d*). Otherwise, the bit is set to 0. The usage of these two bits is explained 
in the following two phases. 

For example, consider the database in Table 4, which is updated from the original 
database in Table 1. Assume the minimum support is 25%, that is, an itemset is in L’ 
if the support count is not less than 9x0.25=2.25. The bit vectors and two associated 
bits (A and S) for each large item is shown in Table 5. The large items in the updated 
database are items 1, 2, 3, 4, and 5. Since they are all large in /.,, their associated A 
bits are all set to 1. If the support count of any item is not less than W^lx0.25=0.5, this 
item is large in d*. Since sup*^^= sup*^^=l (>0.5), sup*^^=2 (>0.5), sup* sup*^^=0, 
we set 2. <5=4. <5=5. <5=1, 1. <5=3. <5=0. 
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Table 4. An example of the insertion case Table 5. The hit vectors and associated hits of 

large items in Table 4 

4.2 Graph construction phase 

Each 2-itemset X={ij] where i, j are large items in D’, is a candidate 2- 

itemset. The sup* ^.^^can be found by counting the number of I’s in BV* A BV* . For 
XG L^, sup^ is available from the previous mining result, we can calculate sup’ = 
sup+ sup* If sup’ ^ is not less than \D’\xs%, then X is added into L’^. For Zg L^, 
according to Femma 2, ZGL’^only if sup*^ >1 d*\xs%. If sup*^ >1 d*\xs%, we perform 
BV.aBV. and count the number of I’s to find sup^, then add this count to get sup’ If 
sup’ ^ is not less than \D’\xs%, X is added into L’^. 

The two bits i.A and i.S of each large item i can be used to further improve the 
performance. If any subset of an itemset X is not large, X cannot be large, either. 
Before we check whether { ij] G L^, we can check if both i.A and j.A are equal to 1. If 
either i.A or j.A is 0, X cannot be large in L^. Thus, we save the cost of the 
membership check, which is costly when \Lj is large. For X g L^, if either i. S or j. S is 
0, sup*^ cannot be greater than \d*\xs%. Thus, we know X g without performing 
BV* A BV*, which is costly when the number of transactions in d* is large. 

For each large 2-itemset {ij}, a directed edge from item i to item j is constructed in 
the association graph. We also set two bits X.A and X. S for each large 2-itemset X. 

Continue the above example, there are 10 candidate 2-itemsets. These candidates 
are all in L^, except {1,5} and { 3,5 } . Consider { 1 ,2 1 G L^, sup* number of 1’ s in 
BVj A BV* (i.e., (00))=0, then sup’ =2+0=2 (<2.25). Therefore, {1,2} g L’^. 
Consider {1,5} {1,5} since l.<5=0. After checking these 10 candidates, 

L’2={{1,3}, {2,3}, {2,4}, {2,5}, {4,5}}. The association graph is shown in Figure 2, 
and Table 6 shows the associated bits of the large 2-itemsets. 




Fig. 2. The association graph constructed by DIUP Table 6. The associated bits of the 

large 2-itemsets 
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4.3 Large itemset generation phase 

Candidate itemset is extended from L\^hy the way used in DLG*. Suppose 
is a large A:-itemset, and a candidate can be generated 

successfully based on X. Similar to the above phase, we get sup* by performing 
BV* A BV* A ... A BV, * and then check whether YGL^,. The bits X.A and f ,.A are 
used to save the cost of the membership check. For Tg if either Z.<J or «t^,.<Jis 0, 
we know Fg without performing BV* A BV* A ... A BV^J. Two bits X.A and X.S 
are set for each large itemset X, which can be used in the next iteration. 

Continue the above example. There is only one candidate 3-itemset {2,4,5} 
extended from large 2-itemset (2,4). Since {2,4,5} G L^, sup’ sup^^^^^+ 
SMp*{245|=2+0= 2 (<2.25). Therefore, {2,4,5} g L\. Table 7 compares the number of 
candidates of DLG* with that of DIUP. DLG* performs logical AND operations on 
bit vectors associated with the whole updated database (DU d*), with a total of 11 
candidates. DIUP performs logical AND operations on bit vectors associated with d*, 
with a total of 9 candidates, and performs no operation on those associated with D. 

In the k-\h iteration, Ax {k-\) logical AND operations are performed to determine 
all large A:-itemsets, where N is the candidate number. While DLG* performs 
lOx 1+1 X 2=12 operations on the bit vectors of length 9, DIUP performs 
8x 1+1x2= 10 operations on the bit vectors of length 2. Therefore, DIUP reduces the 

cost of performing logical AND operations by l-^§^ ~ 81%. 



Iteration 


DLG* 


DIUP 




dG cf 


D 


D* 


2 


10 


0 


8 


3 


1 


0 


1 



Table 7. Number of candidate itemsets 



4.4 The deletion algorithm DDUP 

The DDUP algorithm is similar to the DIUP algorithm. As DIUP, we allocate two 
bits X.A and X.5 to indicate if the large itemset X is large in D and d'. The bit X.A is 
used to reduce the cost of the membership check. The usage of X.5 is different from 
that in DIUP, i.e., we need not to check whether X is large if X.5 is equal to 0. Due to 
space limitation, the complete description of DDUP is omitted. 



5 Experimental Results 

To assess the performance of our graph-based algorithms for discovering and 
maintaining large itemsets, the algorithms DLG [10], DLG*, FUPj [4], DIUP, and 
DDUP are implemented on a Sun SPARC/20 workstation. We first show the 
improvement of DLG* over DLG, and then demonstrate the performance of DIUP 
and DDUP by comparing with DLG* and FUPj, which is the most efficient update 
algorithm so far developed. 
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5.1 Synthetic data generation 

In each experiment, we use synthetic data as the input dataset to evaluate the 
performance of the algorithms. The method to generate synthetic transactions is the 
same as the one used in [10]. The parameters used are similar to those in [10] except 
the size of the changed part of the database (rf or d~). The parameters used to generate 
our synthetic database is listed in Table 8. The readers can refer to [10] for a detailed 
explanation of these parameters. 



IDI 


Number of transactions 


\d\ 


Number of inserted/deleted transactions 


1/1 


Average size of the potentially large itemsets 


IM/I 


Maximum size of potentially large itemsets 


ILI 


Number of the potentially large itemsets 


171 


Average size of the transactions 


IM71 


Maximum size of the transactions 


N 


Number of items 



Table 8. The parameters 

We generate datasets by setting A=1000 and ILI=1000. We choose two values for 
171=10 and 20, and the corresponding IM71=20 and 40, respectively. We choose two 
values for 1/1=3 and 5, and the corresponding IM7I=5 and 10, respectively. We use 
Tx.ly.Dm.dn, adopted from [8], to mean that \T\=x, l/l=y, \D\=m thousands, and \d\=n 
thousands. Notice that a positive value n means the size of inserted transactions \d*\, 
and a negative one means the size of deleted transactions \d \. 



5.2 Effects of the minimum support on the update algorithms 

The value of minimum support is varied between 0.5% to 2 %. We use the setting 
T10.I5.D100.dH-l for the insertion case, and T10.I5.D100.d-l for the deletion case. 

Figure 3 shows the experimental results in the insertion case. As shown, DIUP is 
1.8 to 3.5 times faster than DLG*, and 1.9 to 3.8 times faster than FUP^. Figure 4 
shows in the deletion case, DDUP is 1.8 to 3.7 times faster than DLG*, and 2.7 to 6.6 
times faster than FUP^. 

The results show that DIUP and DDUP always have a better performance than re- 
running DLG* and FUP^. The speed-up ratio over DLG* decreases as the minimum 
support increases, because the number of large itemsets becomes smaller and re- 
running DLG* is less costly. In general, the smaller the minimum support is, the 
larger the speed-up ratio over FUPj is, since FUP^ makes more database scans. 
However, in the range 0.5 1.25 in Figure 3 and 0.75<^< 1.25 in Figure 4, the 
speed-up ratio becomes smaller. This is because in these cases, the number of 
candidate itemsets increases with a smaller minimum support, but the number of 
database scans does not increase much for FUP^. 



5.3 Effects of the update size on the update algorithms 

Next, we examine how the size of the changed part of the database affects the 
performance of the algorithms. When the amount of changes becomes large, the 
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performance of update algorithms degrades. There are two major reasons for it. First, 
the previous mining results become less useful when the updated database is much 
different from the original one. Second, the number of transactions which need to be 
handled increases. Two series of experiments showed in Figure 5 and Figure 6 are 
conducted to support the analysis. In the insertion case, we increase the number of 
inserted transactions from 20k to 120k to evaluate the performance ratio. The results 
shows that DIUP is 3.4 to 3.7 times faster than FUP^, and 1.4 to 1.9 times faster than 
DLG*. Although the execution time of DLG* also increases as \d*\ does, the speed-up 
ratio decreases. However, DIUP still has a better performance even when l<7*l=120k, 
which is larger than the size of the original database. 



6 Conclusion and Future Work 



We study efficient graph-based algorithms for discovering and maintaining 
knowledge in the database. The revised algorithm DLG* is developed to efficiently 
solve the problem of mining association rules. DLG* improves DLG [10] by reducing 
the number of candidates and the cost of performing logical AND operations on bit 
vectors. Two update algorithms DIUP and DDUP, which are based on the framework 
of DLG*, are further developed to solve the problem of maintaining association rules 
in the cases of insertion and deletion. The experimental results show that both of them 
significantly outperform FUPj [10], which is the most efficient update algorithm 
developed so far. DIUP always performs faster than re-running DLG* even when \(f\ 
is larger than the size of the original database IDI, and DDUP keeps a better 
performance when \d~\ is not greater than 30% of IDI. 

Currently, we are also developing graph-based algorithms for mining and 
maintaining different kinds of knowledge, such as sequential patterns and generalized 
association rules. 



T10.l5.D100.d-1 




Mininum Support (%) 




Number of inserted transactions 



Fig. 3. Effects of the minimum support in 
the deletion case 



Fig. 4. Effects of the number of 
inserted transactions 
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T10.l5.D100.d-x 



D100.d-x 



T10.I5 



c: 




2.5 




10k 20k 30k 40k 50k 60k 

Number of deleted transactions 



10k 20k 30k 40k 50k 60k 

Number of deleted transactions 



Fig. 5. Effects of the number of deleted transactions Fig. 6. The comparison of DDUP 

with DLG*. 
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Abstract. The Basket Analysis derives frequent itemsets and associa- 
tion rules having support and confidence levels greater than their thresh- 
olds from massive transaction data. Though some recent research tries 
to discover wider classes of knowledge on the regularities contained in 
the data, the regularities in form of the graph structure has not been ex- 
plored in the field of the Basket Analysis. The work reported in this paper 
proposes a new method to mine frequent graph structure appearing in 
the massive amount of transactions. A specific procedure to preprocess 
graph structured transactions is introduced to enable the application of 
the Basket Analysis to extract frequently appearing graph patterns. The 
basic performance of our proposing approach has been evaluated by a 
set of graph structured transactions generated by an artificial simulation. 
Moreover, its practicality has been confirmed through the appliaction to 
discover popular browsing patterns of clients in WWW URL network. 



1 Introduction 

The Basket Analysis derives frequent itemsets and association rules having sup- 
port and confidence levels greater than their thresholds from massive transaction 
data [1],[2]. Some recent research of the Basket Analysis tries to discover wider 
classes of knowledge on the regularities contained in the data. One represen- 
tative work is to introduce taxonomy of items and Boolean constraints among 
items under the taxonomy [3] . The association rules among items satisfying the 
specified constraints are efficiently derived from massive data in their approach. 
Another extension of the Basket Analysis on the class of the knowledge discovery 
is to derive association rules among continuously ordered items, i.e., sequential 
item patterns [4]. The taxonomy and Boolean constraints are one of the most 
commonly used constraints in various data analyses. The sequential data of items 
are also frequently observed in practical applications. 

Another familiar structure and constraint of data which have not been ex- 
plored in the field of the Basket Analysis are the graph structure, i.e., constraints 
on the uni-directional and/or bi-directional relations among nodes. The data 
having the graph structure are widely seen in various problem domains such as 
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the network flow phenomena in information stream of internet, that of car traffic 
stream in urban areas, the parallel process streams in computer operating sys- 
tems, the structure of URLs and their links in the WWW service and causality 
among physical states. The discovery of frequently observed graph structure from 
a set of given data has been researched in the machine learning area, and the 
most representative approach is called “GBI (Graph Based Induction)” [5], [6]. 
Given a set of transactions where each transaction represents a graph consisting 
of some nodes and links, GBI searches typical graph structures observed more 
than a threshold frequency in the transaction data. Another version of GBI pro- 
gram can discover some specific graph structures which characterize the features 
of nodes and/or links contained in the graph. Though GBI provides a powerful 
measure to figure out important graph structures from a set of given data, its 
basic algorithm requires a thorough search in the data to And links contained 
in the objective structures. Accordingly, the state of the art to mine the graph 
structures is not satisfactory for the really massive data. 

The work reported in this paper proposes a new method to mine frequent 
graph structures appearing in the massive transactions. A specific procedure to 
preprocess graph structured transactions is introduced to enable the application 
of the Basket Analysis to extract frequently appearing graph patterns. 

2 Transaction Having Graph Structure 

The transaction analyzed by the conventional Basket Analysis is a set of items. 
For example, the transactions for customers buying items in a grocery store are 
represented as follows. 

customer^ : {milk, bread, butter, }, 



customern : {milk, bread, apple, }. 

On the other hand, a transaction of graph structure contains nodes and links 
as depicted in Figure 1. Given the graph structured massive transactions, if a 
subgraph pattern of {A ^ B ^ C ^ A} appeares more than a certain fre- 
quency level, then this subgraph can be called a ^‘'frequent subgraph” similar to 
the ‘^frequent itemseB . Furthermore, if the transactions containing the subgraph 
of {A — > B{ also contains {A B C A} more than a certain frac- 
tion of the transactions, then an ^^association rule among subgraph structures” 
of{A^R}=^{A^R— s-C^Ajis mined from the transactions. Though 
the basic content of this problem is similar to the data mining of association 
rules among items, the conventional Basket Analysis can only discover the fre- 
quent itemsets and the association rules among the itemsets but not those of 
graph structures. To apply the Basket Analysis to the graph structured data, the 
data representation of each graph structured transaction is transformed into the 
form of the itemset transactions in our approch. Thus, the devivation of associa- 
tion rules among graph structured transactions is enable by this transformation 
within the framwork of the coventional Basket Analysis. 
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Fig. 1. Transactions containing graph structures. 



The basic principle of the transformation of the graph structured transac- 
tions into the itemset transactions is as follows. Given a set of all nodes Vaii = 
{vi,V2, ■■■,Vp} and a set of all links among them Laii = {vi — > Vj\vi,Vj € Vaii\, 
then the transaction having a graph structure is represented as a subset 
of Laiu i.e., Tfe C Laii- As a transaction to be analyzed in the conventional 
Basket Analysis is a subset of all items in the data excluding the null set, the 
transactions having graph structures can also be analyzed in the same framework 
by handling each link Vi Vj in Tk as an item. 

For example, the two graph structured transactions depicted in Figure 1 can 
be represented as the itemset transactions as 

{A^ B,B ^ C,C ^ A,A^ D}, (1) 

{A^ B,B ^ C,C ^ A,C ^ E}. (2) 

By regarding each different link as a corresponding different item, the stan- 
dard Basket Analysis is applicable to mine frequent subgraph structures and the 
association rules among the structures. 



3 Implementation of Basket Analysis 

3.1 Preprocessing Graph Structured Transactions 

The transactions containing graph structured data are given in various forms 
in practical fields. For example, the data of the network flow phenomena are 
usually represented as a collection of the series of nodes and links where the 
objects such as information packets and cars go through. In case of parallel 
process streams in computer operating systems, the streams of the processes 
and the data exchanged among the processes are represented in form of the list 
of the names of the processes and data together with the list of the pointers 
connecting these processes and the data. However, any graph data contained in 
a transaction can be generally transformed without much computational effort 
into the form of an adjaency matrix which is a very well known representation 
of a graph in the mathematical graph theory [7]. Each row and column of the 
matrix correspond to a node that appears in the graph respectively, and if a link 
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from the i-th node to the j-th node appears in the transaction, the value “1” 
is assigned to the ij-element of the matrix, otherwise the value “0” is assigned. 
For the first example in Figgre 1, the adjaency matrix becomes as follows. For 
the second, it is similarly represented. 



A B 


C 


D 


A 


'0 


1 


0 


r 


B 


0 


0 


1 


0 


C 


1 


0 


0 


0 


D 


0 


0 


0 


0 



for the transaction (1) 



Once the adjaency matrix of each transaction is derived, the transaction in form 
of an itemset representing the graph structure is obtained by choosing each pair 
of nodes (i, j) having the value “1” in the ij-element and adding the arrow 
from the node i to the node j in each pair. When the transaction data contain 
only non-directed graph structure where the direction of each link between any 
two nodes is not specified, the adjaency matrix of each transaction becomes 
diagonally symmetric. In this case, the bar ” is added between the nodes i 
and j, and i — j and j — i become an identical link. 

After the data transform of the transactions have been conducted, all links 
appearing in all transaction are sorted and numbered in a lexicographical order 
for the efficient item processing similarly to the conventional Basket Analysis. 
In the example of Figure 1, all links are numbered as follows. 

A^B = 1, B^C = 2, C^A = 3, A^D = 4, C ^ E = 5. 

Then, the expression (1) is rewritten as 

{1,2, 3, 4}, (3) 

The frequent subgraphs and the association rules among the subgraph structures 
directly obtained by the Basket Analysis are also represented by the number 
labels of the links. For the comprehensiveness of the results, their representations 
are transformed back to the original links at the final stage of the analysis. 



3.2 Deriving Representative Association Rules 

The standard Basket Analysis derives all frequent itemsets and all association 
rules having support and confidence levels greater than their thresholds, and 
filters out trivial rules by applying statistical heuristics [1],[2]. However, some 
researches have pointed out that this standard approach can not get rid of re- 
dundant rules and also misses some essential rules because of the incompleteness 
of the statistical heuristics used for the rule filtering [8] , [9] . To alleviate this dif- 
ficulty, the authors have proposed a complete logical rule filter which can retain 
only ^‘representative association rules" [8] . The identical idea has also been pro- 
vided by the other researcher [9]. The representative association rule has the 
characteristic to derive maximal consequences from minimal facts while main- 
taining their support and confidence greater than or equal to the given threshold 
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levels. We apply this principle to derive ^‘association rules among subgraph struc- 
tures” 

The principle to derive the representative association rules is briefly explained 
in this subsection. An association rule has the following form where ” Body”: B 
stands for an itemset and ”Head”: H another itemset (a superset of B). ^ 

B ^ H, where B C H. 

The ” supporf values of B and H, i.e., sup{B) and sup{H), are ratios of the 
number of transactions including each set to the total number of transactions 
respectively. The ” confidence” value, conf{B => H), stands for the credibility 
of the rule, and is defined as a ratio of the number of transactions including H 
to the number of transactions including B. The Basket Analysis generates all 
frequent itemset having its support value greater than a threshold I — sup, and 
derives all association rules where its head is a frequent itemset and its confidence 
value is greater than another threshold value s — conf. 

The association rules derived by these procedures have many redundancies 
that derive identical consequences from identical given facts. These redundancies 
reduce the comprehensiveness of the regularities discovered in the data and the 
efficiency of the use of those rules for some specific purposes. Instead of using 
statistical heuristics to remove the redundancies, we apply the following criteria. 

Support threshold: The head of every association rule must have the support 
greater than a threshold ” lowest supporf’ : I — sup. 

Uniform confidence: Every association rule must have a confidence close to 
but not less than a level ’’specified confidence” : s — conf. 

Maximal consequence: Every association rule must derive a maximally spe- 
cific consequence from a minimal fact. 

The following definitions are introduced to implement these criteria. 

Minimal bodysetiiF For a specified confidence s—conf, if a rule B ^ H satis- 
fies the following condition, B is said to be a ’’minimal bodysef’ of Head : H 
under s — conf. 

conf{B H) > s — conf and conf{B' H) < s — conf MB' C B 

Maximal headsetiiF For a specified confidence s — conf, if a rule B ^ H 
satisfies the following condition, H is said to be a ’’maximal headset’ of 
Body : B under s — conf. 

conf{B H) > s — conf and conf{B => H') < s — conf MB:' D H 

Representative association ruleiiF For a specified confidence s — conf and 
a lowest support I — sup, if Body : B and Head : H of a, rule B H are 
the minimal bodyset and the maximal headset respectively, B H is said 
to be a ” representative association rule” . 

^ This representation of association rules is different from the standard notion B ^ 
R where R = H — B . We use H instead of R for ease of our explanation. 
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This rule satisfies the aforementioned three criteria. 

The representative association rules still contain some redundancy in terms of 
the inference ability. We apply a logical '' rule- filter^' where the rule AB ABR 
is removed, when two maximal estimation rules 

AB ABR and B => BCR 

are obtained. Here, every intersection among A,B,C,R is empty, and AB = 
A-\- B, ABR = A-\- B -\- R and BCR = B C R. This rule-filtering does not 
violate the aforementioned criteria. 

4 Performance Evaluation 

4.1 Validation through Simulation Data 

The basic performance of our proposed method to discover frequent graph struc- 
tures and the association rules among the structures has been validated through 
the graph structured transaction data having clear characteristics. The data 
have been artificially generated through a Monte-Carlo simulation on a path 
array shown in Figure 2. This is a 4 x 4 link array, and vehicles starts from the 
node 1 to arrive at the node 16 by following directed (one way) links. Each vehi- 
cle chooses one of the links to proceed with an equivalent probability (50% each) 
at every binary branch of the links. A run of a vehicle from the start node 1 to 
the final goal node 16 corresponds to a transaction consisting of the intermedi- 
ate links along the path that the vehicle passes through. The total number of 
the links between the adjacent nodes in this path array is 24, and the number 
of the items (links) involved in each transaction is 6. Furthermore, the total 
number of path sequences from the node 1 to 16 is 20, and the total number 
of the possible frequent itemsets in this example are theoretically known to be 
847. The probability of each path that vehicles go through can also be theoret- 
ically evaluated. Totally, 10000 transactions, i.e., the 10000 history simulations 
of vehicle operations, are generated to ensure sufficient statistical accuracy of 
the validation. Multiple combinations of support and confidence thresholds of 
the Basket Analysis are applied in the validation analysis. The algorithm of a 
priori [1] has been used to derive frequent subgraphs. The effect of the support 
threshold I — sup has been assessed in the range of [0%,35%], and that of the 
confidence threshold s — conf has been changed in the range of [30%, 90%]. 

Three examples of the frequent subgraphs discovered by the analysis are 
shown bellow. 



{1 ^ 2, 2 ^ 3} (support = 25.3%), 

{1 ^ 2, 8 ^ 12, 12 ^ 16} (support = 25.4%), 

{1 ^ 5, 5 ^ 9, 9 ^ 10, 10 ^ 11, 11 ^ 15, 15 ^ 16, } (support = 3.1%). 

The first example is a trivial case that its theoretical support value is easily 
known to be 25% because of the twice branching at the nodes 1 and 2. The 
second example contains two separated subgraphs of 1 — > 2 and 8 — > 12 ^ 16. 
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Fig. 2. A 4 X 4 path array. 



This is because many path ways from the node 2 to the node 8 exists, and 
the frequencies that each of the intermediate paths between the nodes 2 and 8 
appear in the data are less than the support threshold I — sup = 25% in this case. 
The expected probability to go from the node 2 to the node 8 is 50%, and thus 
the total expected probability to go though these three paths are 25% which is 
consistent with the support value obtained in the simulation. The last example 
contains a full path ways from node 1 to node 16. The vehicle chooses one of the 
binary branching paths at the node 1, 5, 9, 10 and 11 with equivalent probability. 
Thus, the expected probability to occur this path way is (1/2)® = 1/32 which is 
also consistent with its support value. 

The followings are two examples of the association rules among subgraph 
structures. 

{1 ^ 2} {1 ^ 2, 2 — > 3} {support = 25.3%, confidence = 50.9%), 

{2 — > 3} {1 ^ 2, 2 — > 3} {support = 25.3%, confidence = 100.0%). 

Both of them represent the fact that a vehicle goes through the path way 

1 ^ 2 — > 3 with the support value of about 25%. However, the difference of 
their confidence reflects the geometrical configuration of the paths 1^2 and 
2^3. When a vehicle goes to the node 2 from the node 1, there are two 
choices to go forward at the node 2. In contrast, in the case that a vehicle goes 
through the path 2 ^ 3, it always should have passed the path 1^2. More 
complex examples reflecting the geometry of the paths are shown below where 
the confidence threshold s — conf is set at 30%. 

{2 ^ 6, 11 ^ 15} ^ {1 ^ 2, 2 ^ 6, 6 ^ 7, 7 ^ 11, 11 ^ 15, 15 ^ 16} 
{suppoprt — 2.9%, confidence = 50.6%), 

{2 ^ 6, 11 ^ 15} {1 ^ 2, 2 ^ 6, 6 ^ 10, 10 ^ 11, 11 ^ 15, 15 ^ 16} 

{suppoprt = 2.9%, confidence = 49.4%). 

As easily understood by viewing Figure 2, when a vehicle goes though the paths 

2 — !■ 6 and 11 ^ 15, it necessarily goes though 1^2 and 15 ^ 16, and there 
are only the two choices 6 — *■ 7 ^ 11 and 6— >10^11togo from the node 6 
to the node 11. Accordingly, the confidence of each rule becomes around 50%. 
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Table 1. Computational complexity for support and confidence thresholds. 



1-sup 


s-conf 


Num. of 
freq. subgraphs 


Max. size of 
freq. subgraphs 


Num. of 
rules 


Comp, time [sec] 


Apriori 


Rulegen 


35 


90 


5 


2 


0 


0.07 


0.06 


70 


0 


0.05 


50 


2 


0.05 


30 


4 


0.06 


25 


90 


18 


3 


4 


0.13 


0.06 


70 


6 


0.05 


50 


12 


0.05 


30 


12 


0.05 


15 


90 


54 


4 


12 


0.24 


0.06 


70 


17 


0.07 


50 


22 


0.07 


30 


22 


0.07 


5 


90 


523 


6 


68 


0.55 


0.42 


70 


86 


0.42 


50 


120 


0.44 


30 


130 


0.45 


0 


90 


847 


6 


134 


0.58 


0.89 


70 


152 


0.88 


50 


179 


0.89 


30 


226 


0.97 



These observations indicate that the Basket Analysis for graph structured trans- 
actions in our proposed framework can properly derive the frequent subgraphs 
and the association rules among subgraph structures in a quantitative sense. Ta- 
ble 1 shows the effect of the condition of support and confidence thresholds on 
the computational complexity required. The level of the support threshold has a 
significant influence to the number of the frequent subgraphs. The computation 
time of “Apriorf’ is the time required to derive all frequent subgraphs, and that 
of ^^Rulegen” is the time to derive all representative association rules from the 
frequent subgraphs. The task to derive all frequent subgraphs faces the combi- 
natorial explosion of the items for a low support threshold, while the “Aprion’” 
maintains its significant efficiency due to its well organized algorithm. The com- 
putation time required by '''' Rulegeri' also does not show very drastic increase. 
These observations are consistent with the complexity analysis for the conven- 
tional Basket Analysis [1], [8]. The increase of the maximum size of the frequent 
subgraphs saturates under the condition of I — sup less than 15%, where the 
length of the full path way from the node 1 to the node 16 is 6. This is because 
the support of some full path ways such asl^2^3— >4— >8^12^16 has 
the maximum value of (1/2)^ = 1/8 which is slightly less than 15%. 
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Fig. 3. A subgraph transaction in a huge URL graph. 



4.2 Application to WWW Browsing Histories 

The practical performance of the proposed method has been examined through a 
real scale application. The data analyzed is the log file of the commercial WWW 
server of Recruit Co., Ltd. in Japan. The URLs on WWW form a huge graph, 
where URLs are nodes mutually connected by many links. When a client visits 
the commercial WWW site, he or she browses only a small part of the huge 
graph at an access session as depicted in Figure 3, and the browsing history of 
the session becomes a graph structured transaction. The total number of the 
URLs involved in this commercial WWW site is more than 100000, and it is one 
of the largest site in Japan. Its total number of hit by the nation wide internet 
users always remains within the third place from the top in every month in 
Japanese internet record, and the typical size of the log file of the WWW server 
for a day is over 400MB. 

The basic format of an access record to a URL by a client in the log file is 
indicated in Figure 3. As the log file consists of the sequence of the access records, 
they are initially sorted by the IP addresses, and each subsequence having an 
identical IP address corresponding to the browsing access history in a session 
of an individual client is extracted. Then, those subsequences are transformed 
into adjaency matrix, and each graph structured transaction for a session of the 
individual IP client address are generated as explained in the earlier section. 

Table 2 summarizes the statistical result of the analysis of this data by our 
approach varying the support threshold I — sup and the confidence threshold 
s — conf. This table also shows the similar tendencies on the number of fre- 
quent subgraphs and the computation time with Table 1 that their increases 
are observed when I — sup is decreased. In contrast, the number of the derived 
association rules are decreased, when s — conf is decreased for low I — sup. This 



IP address of a client 


A 


Time stamp of the access 


A 


URL address 



A:space character 



Fig. 4. Basic format of an access record. 
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Table 2. Statistics of analysis on WWW access transactions. 



1-sup 


s-conf 


Num. of 


Num. of 


Comp, time [sec] 


[%] 


[%] 


freq. subgraphs 


rules 


Apriori 


Rulegen 




90.0 




5 




1 


0.6 


70.0 


132 


8 


151 


2 


50.0 


18 


2 




30.0 




30 




2 




90.0 




251 




24 


0.4 


70.0 


625 


186 


392 


25 


50.0 


216 


24 




30.0 




241 




25 




90.0 




2,292 




486 


0.3 


70.0 


4,568 


773 


629 


441 


50.0 


107 


419 




30.0 




101 
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tendency is contradictory to the case of Table 1. This tendency is attributed 
to the feature of the WWW accesses that only the limited number of the URL 
access patterns are commonly shared among many clients while the access pat- 
terns of an individual client are slightly different from those common patterns. 
In other words, the transaction data involve many common subgraphs and the 
associations among them in the WWW case. For example, if the following two 
association rules involving common subgraphs of A — > R on the Ihs and E F 
on the rhs are derived under a high confidence threshold s — con/n, 

{A^ B,B ^ C} {A^ B,B ^ C,E ^ F}, 

{A^ B,B ^ D} ^ {A^ B,B ^ D,E ^ F}, 

then these two rules are subsumed into the following rule under a lower con- 
fidence threshold s — conf^, and the above two rules are filtered out by the 
aforementioned maximal consequence principle. 

{A^ B} ^ {A^ B,E ^ F}. 

On the other hand, the transactions generated in the path array example do 
not share very much common association patterns among subgraphs, since the 
motion of the vehicles along the directed paths are randomly determined. This 
feature induces the monotonic increase of the rules under the decease of s — conf. 

Finally, we show two examples of the association rules among subgraph struc- 
tures obtained in this application. Figure 5 depicts a rule representing that more 

than 50% of the clients who pass the link from the URL titled as “Sports” to 

another “Ball Game” also pass the link from the “Ball Game” to that of ’’Base- 
ball”. Another example shown in Figure 6 says that nearly 60% of the clients 
who go though the link from “Travel” to “Restaurant” also go through the path 
of “Restaurant” ^ “Hobby” — *■ “Arts” — > “Society and Gulture” — *■ “Entertain- 
ment” — > “News” — > “Sports” . Though the Ihs and the rhs of these example rules 
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Fig. 5. Rule example (1) {support = 0.4%, confidence = 52.1%) 




Fig. 6. Rule example (2) {support = 0.4%, confidence = 59.7%) 



just represent the node sequences, the association rules among various types of 
subgraphs including branching and cyclic structures are derived. Figure 7 shows 
such an example including loops in the pattern. This type of knowledge derived 
by the proposed approach can be used to investigate the associations among the 
interest topics of clients of the WWW site which provides important insights for 
marketing on necessary services. 



5 Related Work and Discussion 

R. Feldman et al. applied the conventional Basket Analysis to mine associations 
rules among keyword subsets involved in text files such as document files and 
HTML files, and they proposed a method to generate keyword graphs from the 
association rules [10]. The graphs are generated by merging the pairwise associa- 
tions among keywords involved in the rules. Though this approach can represent 





Fig. 7. Rule example (3) {support = 0.3%, confidence = 47.7%) 
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the associations among multiple nodes in form of graphs, it is to derive associ- 
ations among sets of items and not for the applications where the transactions 
contain graph structured data. On the other hand, Chen et al. proposed to de- 
rive the longest access sequence patterns among URLs. Their work is close to 
our approch. However, knowledge representation discoverd by their framwork is 
limited to the access sequence patterns, whereas our apporch can discover graph 
structured patterns which cover wider classes of knowledge. 

As shown in the previous section, the ability of our method to mine frequent 
subgraph structures and the associations among them are valid, and it is efficient 
for some practical and large scale applications. However, one weakness of our 
current approach is the requirement that all nodes must be mutually distinct 
in the object which produces the transactions. For instance, a common graph 
structure such as memory cell circuits contained in an LSI chip can not be 
discovered from the transactions representing fragments of the chip, because all 
nodes (devices) must be labeled by mutually different numbers, and the memory 
cells having an identical structure are represented by the transactions containing 
different links in this situation. To overcome this limitation, our approach must 
be extended to handle the types(colors) or attributes of nodes and links in graphs. 

6 Conclusion 

The work reported in this paper proposed an approach to mine frequent graph 
structures and the association rules among them embedded in massive trans- 
action data. The approach consists of a preprocessing stage of the transaction 
data and the Basket Analysis. The validity of the principle we proposed has been 
confirmed by adopting to an artificially simulated graph structured data, and its 
practicality has been also demonstrated through a large scale real world problem 
to mine frequent browsing patterns of URLs and the associations among those 
patterns. 
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Abstract. We report the use genetic algorithms (GAs) as a search 
mechanism for the discovery of linear causal models when using two 
Bayesian metrics for linear causal models, a Minimum Message Length 
(MML) metric [10] and a full posterior analysis (BGe) [3]. We also con- 
sider two structure priors over causal models, one giving all variable or- 
derings for models with the same arc density equal prior probability (PI) 
and one assigning all causal structures with the same arc density equal 
priors (P2). Evaluated with Kullback-Leibler distance prior P2 tended to 
produce models closer to the true model than PI for both metrics, with 
MML performing slightly better than BGe. By contrast, when using an 
evaluation metric that better reflects the nature of the causal discovery 
task, namely a metric that compares the results of predictive perfor- 
mance on the effect nodes in the discovered model PI outperformed P2 
in general, with MML and BGe discovering models of similar predictive 
performance at various sample sizes. This supports our conjecture that 
the PI prior is more appropriate for causal discovery. 



With the rise of probabilistic networks for data analysis and prediction, there 
has been steady progress on automating their discovery [1,4,10,9]. Many pro- 
cesses described for learning such networks give them a causal interpretation, 
raising a question fundamental to the problem of scientific discovery itself — 
how to discover causal relationships from observed data. We focus on a par- 
ticular type of probabilistic network, linear causal models, which are widely 
employed in the social sciences, as in structural equation models (SEMs). Using 
evolutionary search [6] we compare two Bayesian metrics for linear causal mod- 
els, one based on Wallace’s Minimum Message Length (MML) Principle [10] and 
the other by Geiger and Heckerman [3]. We then examine the implications of 
the assumed prior over causal structures on the accuracy and relevance of the 
discovered models. We consider two structure priors, one in which models of a 
given arc density are considered equally likely a priori, and one in which each 
total ordering for models of the same arc density has equal prior probability. 
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1 Linear Causal Models 

In this paper we limit our attention to models that can be represented as directed 
acyclic graphs (DAGs). Each vertex in the graph is a random variable, and arcs in 
the graph correspond to direct causal connections. For linear causal models, every 
variable is a weighted linear sum of its parents. We assume variable AT,, 1 < i < AT 
have independently normal residuals Tin ~ iV(0,cr^). Xi is then given by: 



where Ilf is the set of parents of Xi in DAG S, Xm is the ’n}'^ instance of Xi in 
a joint sample D of X. The parameter set of Xi is 0i = {ai, {aki ■ Xk € nf }). 

An important property of causal models arises from the conditional indepen- 
dence relations that can be inferred from a model’s structure. Pearl [7] defines 
d-separation, which identifies all implied independencies of a DAG model. If 
we assume that our DAG models of causality satisfy the Markov property - 
i.e., every d-separation in the model implies a conditional independency in the 
population under study - then distinct DAGs can entail identical conditional 
independencies and thus represent identical joint probability distributions. Be- 
cause such models are indistinguishable by examining their implied distributions 
alone (and, hence, by examining observational data alone), they are said to be 
statistically equivalent. In particular, two DAGs are statistically equivalent just 
in case they have identical undirected adjacencies and have identical v- structures 
— triples {X, W, Y}, where X and Y are not adjacent, but are both parents ofW. 
Such models then differ only in the orientation of one or more arcs. In contrast, 
a causal interpretation of these models necessarily distinguishes any two mod- 
els that differ even just in the orientation of a single arc, say that between X 
and Y. In one model A is a cause of Y, while in the other K is a cause of A. At 
the practical level of employing these two causal models, there can be radically 
different implications upon intervention (e.g., medical intervention). 

1.1 The BGe Metric 

Geiger and Heckerman [3] (and with Ghickering in [4]) derive a Bayesian metric 
for scoring linear causal models. They evaluate the posterior probability of a 
hypothesis H given some sample data D. Although it is difficult to normalize 
the posterior, for model selection it suffices to calculate the joint probability 
of the hypothesis H and the data D, P{H,D) = P{H)f(D\H). For non-causal 
linear models Geiger and Heckerman define H to be the hypothesis that the 
sample distribution can be represented by the structure S. Under this defini- 
tion, if H is true for some structure S', then H is true for all structures that are 
statistically equivalent to S. Heckerman and Geiger [3] call this property hypoth- 
esis equivalence. Although they restrict their analysis to individual structures, 
Heckerman and Geiger note that according to hypothesis equivalence, statistical 
equivalence classes of models should be scored collectively rather than individ- 
ual models themselves. However, if a model is considered causal, the hypothesis 




( 1 ) 
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must include the assertion that each node in S' is a direct cause of each of its 
children. This interpretation invalidates hypothesis equivalence, and thus statis- 
tically equivalent causal networks need not be assigned equal scores. Neverthe- 
less, Heckerman and Geiger surmise that equivalent scores are often appropriate 
even when models are causally interpreted [2]. Our interest is exclusively with 
causal models. We have argued elsewhere against non-causal interpretations [9]; 
here we provide a Bayesian metric that distinguishes between equivalent causal 
models. 

Using hypothesis equivalence Heckerman and Geiger derive the following met- 
ric for scoring linear causal models: 



K 

BGe(5',i:>) =P(S') 

2=1 



f{p{x^’^^}\Sc) 

f{D{^^}\Sc) 



( 2 ) 



where is the data D restricted to Y and Sq is a fully connected network. 
To calculate /(Z?’^|S'c) for an arbitrary structure S', they require that the user 
supply a prior parameterized model, as well as two parameters reflecting strength 
of belief in its accuracy, and N^. For this paper we have assumed the least 
informative parameter priors possible. We set the prior network to contain no 
arcs and use equivalent sample sizes N'^ = 1 and = K + 2. 



1.2 MML Induction 

Minimum Message Length induction [8] applies Bayesian conditionalization in 
information-theoretic form. A two-part message is constructed hypothetically: 
first, the hypothesis itself is encoded, including any parameters, and second, the 
data are encoded given the hypothesis. That hypothesis giving the shortest total 
message is then preferred. Wallace et al. [10] derive the length of the message 
needed to transmit a single variable Xi of a linear causal model (cf. also [6]). 

The MML and BGe metrics differ in two main ways . First, BGe integrates the 
parameters out of the posterior, whereas MML considers parameter estimation 
an important part of model discovery. Second, Geiger and Heckerman assume 
that statistically equivalent models satisfy hypothesis equivalence and prove that 
this requires a normal-Wishart prior over the parameters. In contrast, the MML 
metric adopts an independently normal prior for path coefflcients and a uniform 
prior over the residual variances. Statistically equivalent structures then have 
identical likelihoods only using maximum likelihood parameter estimates. 

1.3 Structure Priors 

To And structure priors for networks, some researchers assume a prior ordering 
of the variables; many use a uniform distributions over structures. Geiger and 
Heckerman use a prior that penalizes deviations from a user supplied prior net- 
work. We prefer to follow Wallace et al. [10] in suggesting that in the absence 
of any knowledge of the causal ordering of the variables, the most sensible prior 
over total orderings is uniform, which prior we dub (PI). We consider this to be 
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the best uninformed structure prior for a causal interpretation of the inferred 
models. In an effort to demonstrate this, we contrast (PI) with (P2) which takes 
all causal structures with the same arc density to be equally likely. 



2 Evolutionary Search 

Various search methods have been applied for discovering causal models. For 
this study we employed a non-standard genetic algorithm (cf. [5]), using DAGs 
directly as the genetic representation, with genetic operators designed for DAGs. 
These operators are described in detail in [-5] and are extended in [6]. To find 
linear causal models using the MML measure, we use use the negative message 
length as our fitness function. Similarly, we use the log of (2) for the BGe mea- 
sure. We will simply refer to the GAs with the respective fitness function and 
structure prior combination as MML-Pl, MML-P2, BGe-Pl and BGe-P2. 



3 Results 

We compared these four methods on seven small datasets (of 2-7 nodes) from 
the social sciences, and on several stochastically generated models.^ 



3.1 Comparison of MML and BGe with Priors PI and P2 

We calculated the Kullback-Leibler (KL) distances between the inferred and 
the true distributions in order to measure the relation between the implied dis- 
tribution of the true and learned models, using maximum likelihood estimates 
for each model’s parameters. Because KL distance only measures the difference 
in implied distributions, it does not distinguish between statistically equivalent 
models, and so it should prefer models found when prior P2 is used. 

Table 1 compares the KL distance averaged over 50 datasets. The reported 
KL distances are multiplied by 100 to reflect the sample size. Boldface results 
indicate the metric with the smallest average KL for each dataset. To measure 
sampling error, the last column of Table 1 gives average KL distances between 
the true model with maximum likelihood parameter estimates for each dataset 
and the true model with its original estimates, and so is an optimal value for 
KL distances. As expected, the P2 prior dominates the results for both metrics, 
with MML-P2 clearly outperforming the other metrics on average. 

If our goal in discovering models is to predict endogenous variables, to inter- 
vene in or explain the relationships between variables, then we are most inter- 
ested in how well the discovered models represent the true causal system. KL 
distance, however, fails to distinguish between statistically equivalent models. 
To overcome this limitation, we evaluated learned models by how well they pre- 
dicted values in the set of effect nodes, i.e., nodes with no descendents. Predictive 
accuracy in effect nodes was measured by summing their expected negative log 



^ For details of the experimental parameters see [6]. 
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Table 1. Average KL distances over 50 datasets of 100 cases for various test 
models. Annotations reported paired t-test significance at a level of 0.05 (e.g., 
for dataset 15.2 MML-P2 is significantly better than BGe-Pl). 



Dataset 


MML-Pl 


MML-P2 


BGc-Pl 


BGC-P2 


True 


15.1 


41.95 


44.05 


43.50 


42.19 


17.89 


15.2 


37.71 


36.31 Bi 


39.50 


37.30 Bi 


12.11 


15.3 


39.93 


39.74 


43.63 


41.74 


17.73 


15.4 


33.23 


31.19 Ml 


31.58 


32.56 


14.41 


15.5 


26.69 


24.48 B 1 B 2 M 1 


27.80 


26.37 Bi 


9.33 


12.1 


20.97 


20.92 


21.24 


20.28 


8.55 


12.2 


31.11 


28.96 Bi 


33.18 


30.27 Bi 


11.46 


12.3 


10.41 


10.45 


11.26 


9.83 Bi 


5.69 


12.4 


31.99 


30.45 


32.65 


31.34 


13.28 


12.5 


25.63 


23.63 


25.43 


23.69 Ml 


15.68 


10.1 


24.77 


24.64 


25.10 


24.52 


11.03 


10.2 


21.59 


22.09 


22.06 


21.61 


9.94 


10.3 


23.52 


22.80 


23.61 


22.66 


10.80 


10.4 


7.01 


7.28 


6.26 


6.83 


5.41 


10.5 


18.39 


16.72 BiMi 


19.99 


17.72 Bi 


6.55 


Loohlin 


9.11 M2 


10.88 


8.98 M 2 


9.41 M 2 


8.06 


Rodgers 


25.32 Bi 


22.99 Bi BqMi 


28.57 


23.93 MiBi 


10.17 


Miller 


6.83 


6.91 


6.76 


6.76 


6.54 


Goldberg 


12.95 


12.34 


13.87 M 2 


13.17 


7.26 


Fiji 


7.27 Bi 


6.96 Bi 


9.05 


7.55 Bi 


4.59 


Evans 


7.69 


7.53 


7.28 Ml 


7.30 M 1 M 2 


6.65 


Blau 


19.06 B 1 B 2 


17.44 B 1 B 2 M 1 


21.53 


21.06 


8.99 



likelihood E[— LL]. This measure discriminates between two structures in which 
a causal arc into an effect node is reversed even if the two models are statistically 
equivalent. Table 2 gives average E[— LL] results for our experimental models. 
The results clearly show that PI gives better predictive performance over the 
effect nodes than P2. This supports our conjecture that for learning causal struc- 
ture a structure prior that considers variable orderings equally likely is more 
appropriate than one that treats statistically equivalent models as equally likely. 



4 Conclusion 



Using genetic algorithms to search for linear causal models, we compared two 
Bayesian metrics (MML and BGe), each with two structure priors (PI and P2). 
On KL-distance the structure prior that considers equivalent structures equally 
likely (P2) on average finds models closer to the true model than the prior that 
treats variable orderings as equally likely (PI), with MML finding more models of 
shorter KL-distances than BGe on average. However, when the correctness of the 
discovered causal structure is measured more directly, via predictive performance 
over the effect nodes, PI clearly outperforms P2 for both measures, with BGe and 
MML showing similar performance. This supports our conjecture that a structure 
prior treating orderings as equally probable is more suited to causal discovery 
than one that assigns statistically equivalent models equal prior probabilities. 
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Table 2. Average expected negative log likelihood of the effect nodes over 50 
datasets of 100 cases for various test models. 



Dataset 


MML-Pl 


MML-P2 


BGe-Pl 


BGC-P2 


True 


15.1 


9.760 


9.794 


9.748 M2 


9.789 


9.541 


15.2 


15.702 


15.661 


15.751 


15.638 Bi 


15.070 


15.3 


11.663 B2M2 


11.825 


11.639 M2B2 


11.899 


11.418 


15.4 


11.415 


11.423 


11.379 Ml MoBo 


11.435 


11.242 


15.5 


13.799 


13.724 


13.852 


13.845 


13.148 


12.1 


12.035 


12.023 


12.018 


11.969 


11.718 


12.2 


9.688 


9.660 


9.685 


9.668 


9.399 


12.3 


13.763 M2 


13.797 


13.765 M2 


13.770 M2 


13.702 


12.4 


6.863 B2M2 


7.145 


6.829 M2B2 


7.166 


6.708 


12.5 


10.598 B2M2 


10.847 


10.625 M2B2 


10.768 


10.390 


10.1 


5.216 


5.235 


5.220 


5.256 


5.077 


10.2 


9.233 


9.249 


9.279 


9.214 


8.683 


10.3 


5.443 Bi BoMo 


5.622 


5.481 M2B2 


5.646 


5.330 


10.4 


11.417 B2M2 


11.652 


11.405 M2B2 


11.691 


11.289 


10.5 


11.543 


11.498 


11.473 Ml 


11.456 Ml 


11.200 


Loehlin 


5.648 B2M2 


6.055 


5.625 M2B2 


6.082 


5.621 


Rodgers 


1.917 


1.867 BiMi 


1.911 


1.863 MiBi 


1.717 


Miller 


1.523 


1.499 


1.501 


1.526 


1.100 


Goldberg 


1.483 M2 


1.554 


1.469 M2 


1.500 


1.357 


Fiji 


1.996 B2 


2.002 


2.005 


2.009 


1.965 


Evans 


1.678 


1.705 


1.660 


1.695 


1.620 


Blau 


1.820 


1.828 


1.798 


1.812 


1.661 
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Abstract. A graph is one of the most common abstract structures and 
is suitable for representing relations between various objects. The analyz- 
ing system directly manipulating graphs is useful for knowledge discov- 
ery. Formal Graph System (FGS) is a kind of logic programming system 
which directly deals with graphs just like first order terms. We have de- 
signed and implemented a knowledge discovery system KD-FGS, which 
receives the graph data and produces a hypothesis by using FGS as a 
knowledge representation language. The system consists of an FGS inter- 
preter and a refutably inductive inference algorithm for FGSs. We report 
some experiments of running KD-FGS and confirm that the system is 
useful for knowledge discovery from graph data. 



1 Introduction 

Machine learning and data mining technology have been used for knowledge dis- 
covery and prediction in many fields [1]. The aim of knowledge discovery is to 
find a small and understandable hypothesis which explains data nicely. A graph 
is one of the most common abstract structures and is suitable for represent- 
ing relations between various objects [6]. We believe that the analyzing system 
directly dealing with graphs is useful for knowledge discovery. 

Formal Graph System (FGS, [5]) is a kind of logic programming system 
which directly deals with graphs just like first order terms. So FGS is suitable 
to represent logical knowledge explaining the given graph data. In this paper, 
we propose a knowledge discovery system KD-FGS (see Fig. 1). As inputs, the 
system receives positive and negative examples of graph data. As an output, 
the system produces an FGS program which is consistent with the positive and 
negative examples if such a hypothesis exists. Otherwise, the system refutes 
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A Pool of Hypothesis Spaces 



Fig. 1. KD-FGS: a knowledge discovery system from graph data using FGS 

the hypothesis space. KD-FGS consists of an FGS interpreter and a refutably 
inductive inference algorithm of FGS programs. The FGS interpreter is used to 
check whether a hypothesis is consistent with the given graph data or not. 

A refutahly inductive inference algorithm, proposed by Mukouchi and 
Arikawa [2], is a special type of inductive inference algorithm with refutabil- 
ity of hypothesis spaces. Suppose that a hypothesis space is refutably inferable 
and data are successively given to the algorithm for the hypothesis space. If 
there exists a hypothesis describing the data in the hypothesis space, then the 
algorithm will infer the hypothesis, that is, it will eventually identify the hypoth- 
esis. If not, then the algorithm will refute the hypothesis space, that is, it will 
tell us that no hypothesis in the hypothesis space explains the data and stop. 
When the hypothesis space is refuted, the algorithm chooses another hypothesis 
space and tries to make a discovery in the new hypothesis space. By refuting the 
hypothesis space, the algorithm gives important suggestions to achieve the goal 
of knowledge discovery. Thus, KD-FGS is useful for knowledge discovery from 
graph data. 

2 FGS as a New Knowledge Representation Langnage 

Formal Graph System (FGS, [5]) is a kind of logic programming system which 
directly deals with graphs just like first order terms. 

Let E and A be finite alphabets, and let X be an alphabet, whose element is 
called a variable label. Assume that (FlUyl) n X = 0. A term graph g = (V, E, H) 
consists of a vertex set V, an edge set E and a multi-set El whose element is 
a list of distinct vertices in V and is called a variable. And a term graph g has 
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a vertex labeling t/jg : 1/ — > Zl, an edge labeling il^g : E ^ A and a variable 
labeling Xg : H ^ X. A term graph is called ground AH = 0. For example, 
a term graph g = (V,E,H) is shown in Fig. 2, where V = {ui,U2}, £1 = 0, 
H = {ei = {ui,U2),e2 = (mi,W2)}, ^g(ui) = s, ipg{u2) = t, Ag(ei) = x, and 
Ag(e2) = y. An atom is an expression of the form p{gi , . . . ,gn), where p is a 
predicate symbol with arity n and pi, . . . , are term graphs. Let A, Bi, . . . , Bm 
be atoms with m > 0. Then, a graph rewriting rule is a clause of the form 
A <— Bi, , Bm- An FGS program is a finite set of graph rewriting rules. For 
example, the FGS program Esp in Fig. 1 generates the family of all two-terminal 
series parallel (TTSP) graphs. 

Let p be a term graph and cr be a list of distinct vertices in g. We call the 
form X := [p, cr] a binding for a variable label x G X. A substitution 0 is a finite 
collection of bindings {x\ := [pi, cri], . . . , a;„ := [p„, cr„]}, where Xi's are mutually 
distinct variable labels in X and each Pi (1 < f < n) has no variable labeled with 
an element in {x \, . . . , Xn}. For a set or a list S, the number of elements in S is 
denoted by jS"]. In the same way as logic programming system, we obtain a new 
term graph / by applying a substitution 9 = {a;i := [gi, ai], ■ ■ ■ ,Xn ■= [pn, o-„]} 
to a term graph g = (V,E,H) in the following way. For each binding Xi := 
[SiiCTi] G 0 (1 < * < n) in parallel, we attach gi to g by removing the all 
variables ti, - ■ ■ ,tk labeled with Xi from H , and by identifying the m-th element 
of tj and the m-th element of cTj for each 1 < j < fc and each 1 < m < jtgj = jaij, 
respectively. The resulting term graph / is denoted by gO. In Fig. 2, for example, 
we draw the term graph g9 which is obtained by applying a substitution 0 = 
{x := [pi, (ui, 'C2)], 0 := [92, (wi,W2)]} to the term graph g. A graph rewriting 
rule C is provable from an FGS program T if C is obtained from E by finitely 
many applications of graph rewriting rules and modus ponens. 




Fig. 2. Term graphs g and gO obtained by applying a substitution 0 = {a: := 
[giAvi,V2)],y ■■= [g2,{wi,W2)W to g. 



3 Refutably Inductive Inference of FGS Programs 

In this section, we show that sufficiently large hypothesis spaces of FGS programs 
are refutably inferable and thus give a theoretical foundation of the KD-FGS 
system. We give our framework of refutably inductive inference of FGS programs 
according to [2,4]. Let g = (F, E, H) be a term graph. We denote the size of g by 
\g\ and define jt/] = [Fj-l-lifl-l-liJl. For example, \g\ = [Fl-t-lAl-l-lFtl = 2-I-0-I-2 = 4 
for the term graph g = (V, E, H) in Fig. 2. For an atom p{g \^ . . . , 0„), we define 
lb(01,- ■ ■,9n)\\ = Isij H h \gn\- 
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Definition 1. A graph rewriting rule A <— Bi, . . . , Bm is said to be weakly 
reducing (resp., size-bounded) if \\Ad\\ > ||-Bi0|| for any i = and any 

substitution 0 (resp., ||A0|| > ||Bi0|j + • • • + ||Bm0|| for any substitution 6). An 
FGS program B is weakly-reducing (resp., size-bounded) if every graph rewriting 
rule in B is weakly reducing (resp., size-bounded). 

For example, the FGS program Bsp in Fig. 1 is weakly reducing but not 
size-bounded. The set of all ground atoms (i.e., ground facts) is called the Her- 
brand base, denoted by Ti.B, and is considered as the set of all training examples. 
A subset I of HB is called an interpretation, and is considered as a set of pos- 
itive training examples. An FGS program B is called a correct program for an 
interpretation I if the least Herbrand model of B, which is the set of all ground 
atoms proved from B, is equal to B A complete presentation of an interpreta- 
tion I is an infinite sequence (wi,ti), {w 2 ,t 2 ), • • • of elements in I x {-h, — } such 
that {wi \ ti = -\-,i > 1} = I and {wi \ ti = —,i > 1} = HB \B A refutably 
inductive inference algorithm is said to converge to an FGS program B for a 
presentation, if it produces the same FGS program B after some finitely many 
times of hypothesis changes. We can construct a machine discovery system for 
a refutably inferable hypothesis space. Thus the following Theorem 1 gives a 
theoretical foundation of KD-FGS. 

Definition 2 ([2]). A refutably inductive inference algorithm is said to re/wtafely 
infer a hypothesis space C from complete data, if it satisfies the following condi- 
tion: For any interpretation I C TiB and any complete presentation S of I, (1) 
if there exists a correct program in C for I then the algorithm converges to a 
correct program in C for / from <5, (2) otherwise the algorithm refutes C from S. 

Theorem 1 (Based on [2]). Bor any n > 0, the hypothesis space WTZ^-"'^ 
(resp., SB^-^^) of all weakly reducing (resp., size-bounded) BGS programs with 
at most n graph rewriting rules is refutably inferable from complete data. 

4 Implementation and Experimental Results 

We have implemented a prototype of the KD-FGS system by constructing an 
FGS interpreter and a refutably inductive inference algorithm in Gommon Lisp. 
The FGS interpreter is an extension of the Prolog interpreter (P. Norvig[3], 
Ghap. 11). In Table 1, we summarize 6 experiments of running KD-FGS on a 
DEG-Alpha compatible workstation (clock 500 MHz) with GGL2.2. In Exp. 1 
and 2, input data are positive and negative examples of TTSP graph (see Fig. 1). 
In Exp. 1 (resp., 2), the hypothesis space Ci (resp., C 2 ) is the set of all restricted 
weakly reducing FGS programs with at most 2 (resp., 2) atoms in each body 
and at most 2 (resp., 3) rules in each program, which is denoted by “#atom< 2, 
#rule< 2” (resp., “#atom< 2, ^rule< 3”). After the system receives 3 positive 
and 5 negative examples, which is denoted by “#pos=3, #neg=5”, it refutes Ci 
in Exp. 1 (resp., it converges to a correct FGS program in C 2 for TTSP graphs 
in Exp. 2). We confirm that the system is useful for knowledge discovery from 
graph data. 
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Table 1. Experimental results on the KD-FGS system 



No. 


Examples 


Hypothesis Space 


Received Examples 


Result 


1 


TTSP 


#atom< 2, #rule< 2 


#pos=3, 


#neg=5 


refute 


2 


graph 


^atom< 2, #rule< 3 


#pos=3, 


#neg=5 


infer 


3 


undirected 


7 ^atom< 1, #rule< 2 


#pos=3, 


#neg=6 


refute 


4 


tree 


#atom< 1, #rule< 3 


#pos=3, 


#neg=6 


infer 


5 


directed 


^atom< 1, ^rule< 2 


#pos=3, 


#neg=10 


refute 


6 


tree 


#atom< 1, #rule< 3 


#pos=4, 


#neg=10 


refute 



5 Conclusion 

We have designed and implemented a knowledge discovery system KD-FGS 
which produces an FGS program as logical knowledge for graph data. In or- 
der to achieve practical speedup of KD-FGS, we are implementing another FGS 
interpreter, which is based on a bottom-up theorem proving method, in a parallel 
logic programming language KLIG. 
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Abstract. In this paper, we propose a new approach to apply meta- 
learning concept to distributed data mining. We name this approach 
Knowledge Probing where a supervised learning process is organised into 
two learning stages. In the first learning phase, a set of base classifiers 
are learned in parallel from a distributed data set. In the second learning 
phase, meta-learning is applied to induce the relationship between an 
attribute vector and the class predictions from all the base classifiers. By 
applying this approach to an environment where base classifiers are pro- 
duced from distributed data sources, the output of Knowledge Probing 
process can be viewed as the assimilated knowledge of that distributed 
learning system. Some initial experimental results on the quality of the 
assimilated knowledge are presented. We believe that an integration of 
Knowledge Probing technique and the available data mining algorithms 
can provide a practical framework for distributed data mining applica- 
tions. 

Keywords: Distributed data mining. Committee Learning, Classifica- 
tion data mining. 



1 Introduction 

The vast quantities of commercial and scientific data being stored worldwide 
currently are increasingly being seen as the source of hidden knowledge. In the 
past decade a significant amount of researches in the field of data mining have 
been done, resulting in a variety of algorithms and techniques for automatically 
extracting this hidden information from data. However, there are some important 
challenges in using data mining technologies to real world applications: 

— data can be large: the execution time of the learning processes can be pro- 
hibitive when applying the algorithms to volumes of data generated in real 
world applications 

— data can be distributed: data can be physically distributed at remote sites 

Distributed data mining provides a promising solution to these challenges. 
The idea is to use data mining algorithms to extract knowledge from several 
(normally disjointed) distributed data sets and then use the knowledge from 
these individual learned models to create a unified body of knowledge that well 
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represents the whole data. Thus, distributed data mining can be characterised 
as an integration of multiple distributed learned models. Such an integration 
can be done by learning the results of multiple base-learning processes. We can 
therefore relate distributed data mining with two important research fields of 
machine learning : committee learning, where multiple models are learned for 
making accurate predictions by avoiding various kinds of learning bias and meta- 
learning, where models are re-learned and integrated. The following advantages 
of distributed data mining stem directly from the combination of these two 
learning approaches: 

— Learning Accuracy : Using different learning algorithms to learn models from 
distributed data sources increases the possibility of achieving higher accu- 
racy especially on a large-size domain. This is because the integration of 
such models can represent an integration of different learning biases which 
possibly compensate one another in their inefficient characteristics. Through 
statistical reasoning, the average of independent estimators (i.e. base models 
learned from distributed data set) will have less variance than one individual 
estimator (i.e. a single model learned from a whole data set). If variance is 
defined as a quantity that measures the sensitivity of a model to an unseen 
data item, then learning multiple models can be an effective way to over- 
come the over-fitting problem or the problem of a model being trained to the 
point where it is highly accurate on the training set, but not on unseen data 
items. Hansen & Salamon [11] have shown that, for an ensemble of neural 
networks, if all base models have the same probability of making error of 
less than 0.5 and if all of them make errors independently, then the overall 
error must decrease as a function of the number of models. Some current 
researches have also shown that, for learning algorithms such as CART and 
C4.5 which have a high variance [3], a framework of multiple learning from 
partitioned learning spaces would reduce error caused by the variance of the 
algorithms [2,3]. 

— Execution Time and Memory Limitation: Distributed data mining provides 
a natural solution for large scale data mining where algorithm complexity 
and memory limitation are always the main obstacles. If there is a multicom- 
puter system available, then each processor can work on a different partition 
of data in order to independently derive a model. Some minor communica- 
tion overhead is expected to incur in this process. A clever model combi- 
nation technique then can benefit from the set of derived models. Hence, a 
distributed architecture such as distributed memory parallel computer sys- 
tems or workstation clusters will be an ideal platform for distributed data 
mining. This approach is scalable, as an increasing amount of data can be 
compensated by a linear increase of number of processors or workstations. 

So far, many researches on distributed data mining have been concentrated 
on committee learning [1,4] where the emphasis is put on making accurate pre- 
dictions based on multiple models. Another key technology for distributed data 
mining is knowledge assimilation. That is, after knowledge is induced from sub- 
sets of data, the key step of the distributed data mining process is to integrate 
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pieces of knowledge into a comprehensive model which represents the whole data 
set. To achieve that goal, meta-learning is required to learn the relationship of 
all the knowledge learned from the base learning phase. Applying meta learning 
technology to knowledge assimilation is still a very new research field. 

One pioneer work on applying meta-learning to distributed data mining is 
done by Chan and Stolfo [6]. They concentrate on learning from the output of 
concept learning systems. Although Chan’s approach of meta-learning provides 
an interesting and potentially useful solution to the distributed learning and data 
mining, there are some fundamental limitations. The first one is the problem of 
knowledge representation. In Chan’s approach, the final classifier produced can 
be regarded as a black box. It does not provide any understanding of the data. It 
therefore can serve the purpose of prediction, but lacks in the descriptive func- 
tion because the meta-classifier is not the integration of the knowledge from the 
base classifiers, but instead the statistical combination of predictions from base 
classifiers. Moreover, the algorithm seems to be susceptible to the distribution 
of initial data sets. It is noted in [7] that a bias introduced by a particular distri- 
bution formed by a data reduction method has to be taken into consideration. 
It appears that, in Chan’s approach, the quality of the classifier depends on the 
distribution of the initial data set. In addition, the distribution of the unseen 
(test) data set is also likely to have an effect on the accuracy of the classifier. 
Thus, the accuracy of a meta-classifier tends to vary if there is a difference in 
the distribution of the training set and the unseen data set. 

In our opinion, one of the most important goals of meta-learning is to assim- 
ilate knowledge learned by base learning systems. This is particularly important 
to the application of data mining since the descriptive power of the model learned 
by a meta-learning system determines the value of the model. In order to derive 
a descriptive model, we propose the idea of Knowledge Probing (KP) as a new 
meta-learning approach for assimilating the knowledge learned by base learning 
systems in a distributed data mining environment. The paper is organised as 
follows : in the following section, we present the basic concept of the knowledge 
probing framework. The design of the experiments is presented Section 3. The 
experimental results and analysis are in Section 4. The conclusion and further 
work are discussed in the last section. 



2 Knowledge Probing Framework 

Knowledge Probing (KP) is first proposed in [9] as a technique to probe de- 
scriptive knowledge from a black box model. In classification data mining term, 
a black box model, for example, a neural network, is a model which takes an un- 
classified data set as an input and gives a set of class predictions as an output. 
The key idea underlying KP is to derive a descriptive model from a black box 
model by learning from the un-classified data set and the corresponding set of 
predictions made by the black box. The basic principle of KP can be presented 
as follows: 
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Input: A set S of n unseen data items, 

a black box model B , 

a learning algorithm D which provides descriptive output . 

Prediction phase: creating the class values of data items in S 

C = B(. {sJ, I sJ e S, i = ) 

Learning phase: learning from the new training set 

constructed from S using the predicted class 
values C 

B* = D( {(sJ, cJ) I sJ G S, cJ £ C, i = ) 

Output : Descriptive model B* 

This idea can be easily extended for model integration in distributed learning 
environment. In distributed data mining, a set of distributed learned base models 
are used collaboratively to make predictions based on a prediction scheme such 
as a simple voting or the arbiter-combiner model of Chan and Stolfo [5] . We can 
therefore regard such a set of base models together with its prediction scheme 
as a black box. Knowledge Probing approach can then be employed to derive a 
descriptive model which assimilates the knowledge of all base models. 

Given a set of models (classifiers) Ai = {Mj\j = l,...,fc}, a prediction 
scheme V which uses the model set M to assign class values to unseen data 
items and a learning algorithm C which is capable of producing a descriptive 
model. A knowledge probing procedure can be extended as follows: 

Input: A set S' of n unseen data items, 

A set A4 of k base models, 

A prediction scheme V, 

A learning algorithm £. which provides descriptive output . 

Prediction phase: creating the class values of data 

items in S using the prediction scheme P 



for i = 1 to n { 

for j = 1 to A: { 

{cjij = M^j(sji) I G 7\d, sJ G S} 

} 

CJ = V{cJl, . . . , cJk) 

} 

Learning phase: learning from the new training set constructed 
from S together with the combined prediction values 

B* = £.(. {(sJ, CJ) I sJ G S, i = 



Output : Descriptive model B 
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The key step of the KP approach is to use an independent data set^ to probe 
knowledge from all base models. Since the final model is learned from a data 
set whose class values are assigned by a prediction scheme, which integrates 
a set of predictions made by all base models, it then can be regarded as an 
approximation of the integration of the base models. Therefore, if the model B* 
which is learned from the learning algorithm £ has a descriptive representation, 
then the model in fact assimilates the knowledge of all base models in A4. 

Implementing a high quality KP framework is a challenging research since 
the rich choices in implementing each component result in a high dimensional 
design space. Empirical studies is therefore crucial to understand the issues of 
constructing a high quality KP framework for distributed data mining. In the 
next section, we present the design of the experiments performed to study be- 
haviours of KP framework. 

3 Experimental Design 

To investigate the behaviours of KP framework in comparison with the tradi- 
tional learning approach^, two corresponding sets of experiments are designed 
to be executed in parallel. 

The detail of the design is as follow. 



3.1 Data Preparation 

The set of experiments are done on fifteen data sets from the UCI Repository 
of Machine Learning Databases [14]. We simulate distributed environment by 
partitioning training sets. Some part of the data set is used as a probing set. 
In order to provide a reasonable amount of data for the base model learning 
phase and probing phase, we chose the data sets which are relatively large in 
size (at least 1,000 instances). Another concern about data preparation is the 
size of the evaluation or test set. To make a reliable comparison between the two 
approaches, the result should have a small deviation. This requires the test set 
size to be relatively large. Bauer and Kohavi [2] observe that in some data sets, 
a learning algorithm can generate a high accuracy model by using only a small 
part of the training set. For example, in the Mushroom data set, training with 
2/3rd of data usually results in 0% error. For this purpose, we have used the 
Learning Curve feature of MCC++ [12] to generate learning curves of different 
training and test set sizes on chosen data sets. In selecting the training set size 

^ An approach to prepare this independent data set is expected to be a crucial further 
research issue of KP. Meanwhile, to avoid additional bias, the probing data set used 
in this study is independent from all data used to learn any base model. That is 
all data items in the probing set are separated from data used in the first learning 
phase. 

^ In this paper, we refer to the traditional learning approach as a learning process 
which takes a complete data set as an input and generates a single model as an 
output. 
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of each data set, we chose the size where the error is relatively low and there 
is a reasonable amount of data left to use as a test set. In this experiment, at 
least 60% of the data set is used as a test set. In addition, we also chose the 
point where the standard deviation of the estimation was not large in order to 
avoid the variability of learning from small training sets. 

Details of all data sets and the percentage of data used as in training are 
described in Table 1. 



Data Set 


Size 


Attribute 


Missing 

value(%) 


# 

Glass 


Data Density 


Entropy 


training 
size (%) 


Discrete 


Gout. 


Real World Data Set 


Abalone 


4,177 


1 


7 


0 


29 


6.12 


X 


10-it. 


1.00 


20 


Adult 


48,842 


8 


6 


0.95 


2 


4.01 


X 


10-1" 


0.79 


30 


German 


1,000 


13 


7 


0 


2 


1.02 


X 


10“i® 


0.88 


40 


Hypothyroid 


3,163 


18 


7 


6.74 


2 


1.29 


X 


10-1" 


0.28 


40 


Ghess 


3,196 


36 


0 


0 


2 


3.10 


X 


10-08 


1.00 


20 


Letter 


20,000 


0 


16 


0 


26 


1.08 


X 


10-1® 


4.70 


20 


Mushroom 


8,124 


22 


0 


1.39 


2 


4.96 


X 


10-1^ 


1.00 


20 


Nursery 


12,960 


8 


0 


0 


5 






1.00 


1.72 


30 


Satellite 


6,435 


0 


36 


0 


7 


1.58 


X 


10-64 


- 


20 


Segment 


2,310 


0 


19 


0 


7 


1.57 


X 


10-44 


2.81 


30 


Shuttle 


58,000 


0 


9 


0 


7 


4.03 


X 


10-14 


0.96 


20 


Sick 


3,162 


18 


7 


6.74 


2 


8.60 


X 


10-18 


0.45 


40 


Thyroid 


3,772 


22 


7 


5.54 


5 






Inf 


- 


40 


Artificial Data Set 


LED7 


2,000 


7 


0 


0 


10 






15.60 


3.32 


30 


Waveform 


5,000 


0 


40 


0 


3 


1.55 X 10“i°® 


1.58 


20 



Table 1. Details of all data sets used in the experiments. 



3.2 Overall Design 

To ensure the fairness of the comparison, each experiment is separated into 
three phases: data preparation phase, distributed committee learning phase and 
Knowledge Probing phase. The data preparation is done once for each data set. 
The same training and evaluation sets are then used in the later phases for 
both sequential and distributed approaches, which are executed parallelly. In 
each of those experiments, given a data set, a learning algorithm L, a prediction 
scheme P and some parameters, i.e. number of partitions, the experiments are 
designed as follows. 

Data Preparation Phase 

1. According to our previous studies [15,10], we randomly divide the data set 
into two parts, T and PS with the proportion of 90 % and 10% respectively. 
PS is used as a probing set in the second phase of the framework. T is kept 
to be further divided in the next phase. 
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2. T is randomly divided into D and E. D is used as a data set from which 
we further sample training sets. According to the suggestion of Kohavi and 
Wolpert [13], D is made twice the size of the desired training set size which 
is chosen from points in the Learning Curve. E is use as an evaluation set. 
The size of E is designed to be at least 40% of T. 

3. Ten sample sets (training sets) x are generated from D by using uniform 
random sampling without replacement. 

First Phase: Distributed Committee Learning 

1. In each sample set x from the previous phase, we perform a mutually exclu- 
sive partitioning of the data into n roughly equal subsets. (Here, we used n 
equals to four.) 

2. With a given learning algorithm L, we learn a model from each subset and 
call it a base model. 

3. A set of n base models together with the prediction scheme P are then 
evaluated on the evaluation set E. 

Parallelly, the sequential learning is done as follows: 

1. A learning algorithm L is used to learn a single model from a sample 
set X. 

2. Ms is then evaluated on the evaluation set E. 

Second Phase: Knowledge Probing 

1. The set of n base models from the first phase is used to generate n sets of 
predictions on the probing set PS. 

2. A prediction scheme P is then used to combine those n sets of predictions 
into a set of combined prediction Cd 2 - 

3. A probing set PS with new class values from Cd 2 is then learned by L and 
output a final model Ed. 

4. Model Ed is then evaluated on the evaluation set E. 

Parallelly, the sequential learning is done as follows: 

1. Model Ms is used to give a set of predictions Cs 2 on the probing set PR. 

2. The probing set PR with the new assigned class value from Cs 2 is then 
learned by L and output the final model Fs. 

3. Fs is evaluated on the evaluation set E. 

In this study, we use C4.5 as a learning algorithm (L), a probabilistic pre- 
diction^ as a prediction scheme (P) and we use number of partitions equal to 
four. 

® In probabilistic prediction, each base model gives a prediction as a set of probabilities 
of each class. The probabilities from all base models are then summed according to 
the class. The final prediction is the class of the highest probability. 
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4 Experimental Results and Analysis 

The results of the experiments on fifteen data sets of the Distributed Committee 
Learning phase (first phase) and the Knowledge Probing phase (second phase) 
are listed in table 2. The results of each experiment (on one data set) were 
average over ten trials of samples. 



Data Set 


Accuracy 


Tree size 


First phase 


Second phase 


Sequential 


Distributed 


Sequential 


Distributed 


Sequential 


Distributed 


Abalone 


20.7672 


22.4479 


20.5410 


22.2749 


146.10 


123.30 


Adult 


84.1456 


84.5841 


83.9809 


84.3332 


196.30 


158.50 


German 


66.0167 


68.3565 


65.7103 


66.3788 


18.10 


13.50 


Hypothyroid 


98.4622 


97.6889 


97.6274 


97.3374 


5.20 


4.80 


Chess 


94.3221 


94.0498 


94.3105 


94.0614 


18.20 


11.60 


LED7 


70.0000 


70.4729 


68.2337 


68.4979 


28.20 


26.20 


Letter 


67.1896 


63.8235 


64.2143 


59.0777 


414.20 


483.40 


Mushroom 


99.3959 


98.5480 


99.1224 


98.5183 


17.70 


11.20 


Nursery 


91.8028 


89.8842 


90.3773 


89.4555 


87.90 


52.00 


Satellite 


79.4363 


80.3299 


78.1628 


77.7704 


47.80 


47.80 


Segment 


90.8063 


87.3406 


87.9061 


84.5728 


23.40 


21.60 


Shuttle 


99.6539 


99.5176 


99.5755 


99.4840 


19.60 


8.60 


Sick 


97.0650 


96.7838 


96.9684 


96.6608 


9.60 


9.70 


Thyroid 


98.5856 


97.5691 


98.1547 


97.0718 


7.00 


6.20 


Waveform 


69.2108 


73.3160 


68.5698 


68.6439 


74.20 


84.20 



Table 2. The average accuracy and tree size (over 10 trials) of both learning 
phases. 



Comparison between Phases The average accuracy in Table 2 showed that in 
the aspect of accuracy, the probed models of both distributed and sequential 
approaches appear to have lower accuracy than the model from the first phase, 
which is what we expected. Because this framework comprises of two phases of 
learning and it is the fact that in each learning phase, an inductive bias has to 
be introduced to cut down the search space, which can also introduce additional 
error to the final result. The difference in accuracy between both phases can be 
viewed as an additional error introduced by KP, which the average additional 
error over all data sets is 0.89 % in sequential case and 1.37 % in distributed 
case. 

Comparison between Sequential and Distributed Approaches In the first phase, 
six data sets in distributed approach have shown an improvement in accuracy 
between 0.44 - 4.11 % while there is such an improvement between 0.07 - 1.73 % 
in five data sets in the second phase. Nonetheless, the accuracy of the distributed 
approach is still relatively comparable to the sequential one. The difference in 
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accuracy between these two approaches in the first phase can be considered as 
the error introduced by distributed committee learning. In this study, the average 
additional error is 0.14 %. In the aspect of size of the probed tree, the results 
from Table 2 showed that eleven out of fifteen data sets have shown an decrease 
in tree size. 

Therefore, in general, the accuracy of the models from the distributed ap- 
proach of both steps have shown to be comparable to the ones of the sequential 
approach while the probed trees of the first approach have shown to be relatively 
smaller. 

5 Conclusion and Further Work 

In this paper, we have presented our ongoing research on Knowledge Probing as 
a new approach towards distributed data mining. Preliminary experiments have 
shown that the framework can be an effective approach in producing a model of 
comparable quality to the traditional approach by assimilating knowledge from 
distributed learned models. We are now investigating various issues of building 
up a distributed data mining framework based on the knowledge probing ap- 
proach. In particular, currently we are studying an impact of properties of a 
probing set to the quality of the final model. It is also a very interesting research 
to studying the theoretical performance model of the framework. Some interest- 
ing researches on applying Bayesian learning theory for estimating performance 
of distributed learning, such as the work of Yamanishi [16], has an influence on 
our research on this subject. 

KP framework provides a general method for knowledge assimilation in var- 
ious scenarios of data mining including extracting knowledge from neural net- 
works, integrating different data mining tasks (e.g. combing a regression proce- 
dure with classification) and incremental learning. A range of applications of the 
knowledge probing methodology will be investigated in our future work. 
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Abstract. Empirical equations are an important class of regularities 
that can be discovered in databases. In this paper we concentrate on 
the role of equations as dehnitions of attribute values. Such dehnitions 
can be used in many ways that we briefly describe. We present a discov- 
ery mechanism that specializes in Ending equations that can be used as 
definitions. We introduce the notion of shared operational semantics. It 
consists of an equation-based system of partial definitions and it is used 
as a tool for knowledge exchange between independently built databases. 
This semantics augments the earlier developed semantics for rules used 
as attribute definitions. To put the shared operational semantics on a 
firm theoretical foundation we developed a formal interpretation which 
justifies empirical equations in their definitional role. 



1 Shared semantics for distributed autonomous DB 

In many fields, such as medical, manufacturing, banking, military and educa- 
tional, similar databases are kept at many sites. Each database stores informa- 
tion about local events and uses attributes suitable for a local task, but since 
the local situations are similar, the majority of attributes are compatible among 
databases. Yet, an attribute may be missing in one database, while it occurs in 
many others. For instance, different military units may apply the same battery 
of personality tests, but some tests may be not used in one unit or another. Sim- 
ilar irregularities are common with medical data. Different tests may be applied 
in different hospitals. 

Missing attributes lead to problems. A recruiter new at a given unit may 
query a local database to hud candidates who match a desired description, 
only to realize that one component ai of that description is missing in so that 
the query cannot be answered. The same query would work in other databases 
but the recruiter is interested in identifying suitable candidates in . 

In this paper we introduce operational semantics that provides dehnitions of 
missing attributes. Dehnitions are discovered by an automated process. They are 
used for knowledge exchange between databases and jointly form an integrated 
semantics of our Distributed Autonomous Knowledge System. 

N. Zhong and L. Zhou (Eds.): PAKDD’99, LNAI 1574, pp. 453-463, 1999. 

@ Springer-Verlag Berlin Heidelberg 1999 
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The task of integrating established database systems is complicated not only 
by the differences between the sets of attributes but also by differences in struc- 
ture and semantics of data, for instance, between the relational, hierarchical 
and network data models. We call such systems heterogeneous. The notion of 
an intermediate model, proposed by Wiederhold, is very useful in dealing with 
the heterogeneity problem, because it describes the database content at a rela- 
tively high abstract level, sufhcient to guarantee homogeneous representation of 
all databases. In this paper we propose a discovery layer to be an intermediate 
model for networked databases. Our discovery layer contains rules and equations 
extracted from a database. 

To eliminate the heterogeneity problem D. Maluf and G. Wiederhold [5] pro- 
posed to use an ontology algebra which provides the capability for interrogating 
many knowledge resources, which are largely semantically disjoint, but where 
articulations have been established that enable knowledge interoperability. The 
main difference between our approaches is that they do not use the intermedi- 
ate model for communication, and they did not consider automated discovery 
systems as knowledge sources. 

Navathe and Donahoo [6] proposed that the database designers develop a 
metadata description (an intermediate model) of their database schema. A collec- 
tion of metadata descriptions can then be automatically processed by a schema 
builder to create a partially integrated global schema of a heterogeneous dis- 
tributed database. In contrast, our intermediate model (a discovery layer) is 
built without any help from database designers. Its content is created through 
the automated knowledge extraction from databases. 

1.1 Methods that can construct operational definition 

Many computational mechanisms can be used to define values of an attribute. 
Ras et al. [4], [11] (1989-1990) introduced a mechanism which first seeks and then 
applies as definitions rules in the form “If Boolean-expression(x) then a(x)=w” 
which are partial definitions of attribute a. Recently, Prodromidis & Stolfo [8] 
mentioned attribute definitions as a useful task. In this paper we expand at- 
tribute definitions from rules to equations. We call them operational definitions 
because each is a mechanism by which the values of a defined attribute can be 
computed. Many are partial definitions, as they apply to subsets of records that 
match the “if” part of a definition. 

1.2 Shared semantics in action: query answering 

Many real-world situations fit the following generic scenario. A query q that 
uses attribute a is “unreachable” at database because a is missing in . 

A request for a definition of a is issued to other sites in the distributed au- 
tonomous database systems. The request specifies attributes ai, ...,a„ available 
at ^i. When attribute a and a subset {fljq, ..., fljq} of {ai, ..., a„} are available in 
another database S 2 , a discovery mechanism is invoked to search for knowledge 
at 52. A computational mechanism can be discovered by which values of a can 




Discovery of Equations and the Shared Operational Semantics 455 



be computed from values of some of Ujq , ai^. If discovered, such a mechanism 
is returned to site si and used to compute the unknown values of a that occur 
in query q. 

The same mechanism can apply if attribute a is available at site si , but some 
values of a are missing. In that case, the discovery mechanism can be applied at 
si, if the number of the available values of a is sufficiently large. 



2 Other applications 

Functional dependencies in the form of equations are a succinct, convenient 
form of knowledge. They can be used in making predictions, explanations and 
inference, a = rm(ai , 02 , ..., a™) can be directly used to predict values a(x) 
of a for object x by substituting the values of ai(x), a 2 (x), ...,am(x) if all are 
available. If some are not directly available, they may be predicted by other 
equations. 

When we suspect that some values of a may be wrong, an equation imported 
from another database may be used to verify them. An equation acquired at the 
same database may be used, too, if the discovery mechanism is able to distinguish 
the wrong values as the outliers. For instance, patterns discovered in clean data 
can be applied to discovery of wrong values in the raw data. 

Equations that are used to compute missing values are empirical general- 
izations. Although they may be reliable, we cannot trust them unconditionally, 
and it is a good practice to seek their further verffication, especially if they are 
applied to the expanded range of values of a. The verffication may come from 
additional knowledge that can be used as alternative dehnitions. Ras [9], [10] 
(1997-1998) used rules coming from various sites and verffied their consistency. 
His system can use many strategies which Rnd rules describing decision attributes 
in terms of classification attributes. It has been used in conjunction with such 
systems like LERS (developed by J. Grzymala-Busse) or AQ15 (developed by 
R. Michalski) . 

Equations that are generated at different sites can be used, too, to cross- 
check the consistency of knowledge and data coming from different databases. 
If the values of a computed by two independent equations are approximately 
equal, each of the equations receives further confirmation as a computational 
method for a. 

All equations by which values of a can be computed expand the understand- 
ing of a. Attribute understanding is often initially inadequate when we receive a 
new dataset for the purpose of data mining. We may know the domain of values 
of a, but we do not understand a’s detailed meaning, so that we cannot apply 
background knowledge and we cannot interpret the knowledge discovered about 
a. In such cases, an equation that links a poorly understood attribute a with 
attributes ai, ..., a„, the meaning of which is known, explains the meaning of a 
in terms of ai , ..., a„. 
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3 Request (Quest) for a definition 

For the purpose of inducing equations from data we could adapt various dis- 
covery systems [7] [2]. We have chosen the 49er system (Zytkow and Zembowicz, 
1993) because it applies to data available in databases and because it searches 
for equations that apply to subsets of data, in addition to equations that apply 
to all data. The system allows to describe one attribute as a function of other 
attributes and it seeks equations when attributes are numerical. It has demon- 
strated successful applications in many databases coming from various domains. 

Special requirements are needed for an equation that can be used as a dehni- 
tion of a given numerical attribute. One of the main problems with the search for 
equations is that the best Rt can be always found for any dataset in any class of 
models (equations). But is the best fit good enough? How good is good enough? 
Equations often provide rough estimates of patterns, but those estimates may 
be not good for definitions. How good must be a fit of an equation so that this 
equation can be used as a definition? 

When we know the desired accuracy of fit, we know how to evaluate equations 
against data. In database applications there is a “natural” limit on the accuracy 
for those common attributes whose values are numerical and discrete. Consider 
an attribute whose values are integers, such as weight in pounds or age in years. 
The error (accuracy) of fit can be derived from the granularity of the domain. 
For any three adjacent values vi,V 2 ,vs in the ascending order, the acceptable 
accuracy of determination of r >2 is (r’s — t^i)/4. For instance, for the age in years, 
the accuracy is half a year. That error rate is entirely satisfactory, but sometime 
even a worse fit is still acceptable from a definition. 

Consider the situations when the required accuracy of fit Si is provided for all 
data (xi, yi, Si), i = I, n. For each candidate equation the probability can be 
estimated that (xi, y\, e,), i = 1, ..., n could have been generated by f(xi) + r,), 
where r,- is generated from normal distribution #(0, e). A demanding probability 
threshold such as Q > 0.01 is also needed. 

In summary, the quest for a definition in the form of an equation includes: 

• the attribute a for which a definition is sought in the form of an equation; 

• the accuracy of attribute a for each value in the domain Va of a; 

• a set of attributes {ai, ..., a„} which can be used in the definition; 

The resultant equations, if any, have the form a(x) = /(ojq , ..., Ojq), and they 
fit the data within a demanding probability threshold Q = 0.01, which is the 
default value for definitions. 

3.1 Functionality Test 

Plenty of time can be saved if equations are not sought in data which do not 
satisfy the mathematical definition of functional relationship. 

Definition: Given a set D of value pairs (vi, Wi), i = I, N of two attributes 

a and h, and the range Va of a; 6 is a function of a in iff for each vq in Va, 

there is exactly one value wq of h, such that (vq, wq) is in B. 
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The following algorithm approximates this dehnition. It determines whether 
it is worthwhile to search for an equation that hts the data. 

Algorithm: Test approximate functional relationship between a and b 

given the contingency table of actual record counts and set Va of values of a 
AV ^ average number of records per cell 
for each value in Va 

hnd all groups of cells with adjacent values of h and counts > AV 
if of groups > a then return NO-FUNC 
if average # of groups > /3 then return NO-FUNC else return FUNCTION 

This algorithm is controlled by two modihable parameters, a and {3, which 
measure local (a) and global (/3) uniqueness of &; that is, the number of values of 
h for the same value of a. The default values used by 49er is a = 2 for data from 
databases, and (3 Ri 1.5. For a = 3 the functionality test fails when for a value in 
Va there are more than 3 adjacent groups of cells with above average density of 
points. This higher value 3 of a solves the problem of rare outliers, allowing up 
to 2 outliers if they happen rarely. However, many outliers or frequent multiple 
values in y should fail the test, therefore the value of (3 is much smaller and close 
to 1. Note that both parameters set to 1 corresponds to the strict mathematical 
dehnition of functionality given above. Presence of error, noise, and other data 
imperfections force values of a and f3 to be larger than 1. The noise handling 
by varying the number of cells in the table is treated in detail by Zytkow & 
Zembowicz (1993). 

The same mechanism applies when we want to determine a functional relation 
in a set of data tuples D of the size 1 + for > 2. If the test is successful, 
equations in the form h(x) = r(hi, ...,&fe) are sought. If the test fails, it will be 
applied to subsets of data when they are generated by 49er. 

3.2 Equation Finder’s search 

The task of equation hnding can be formally dehned by the input of n datapoints 
which come from projection of attributes a and h from data table S, and the 
computation of required accuracy of h: (vi,Wi,ei), i = l,...,n. The output is 
the list of acceptable equations. Since the equations are initially 2-d and can be 
subsequently rehned, the acceptance threshold is at this stage less demanding 
(Q > 0.0001) 

Equation Finder’s search can be decomposed into (1) generation of new 
terms, (2) selection of pairs of terms, (3) generation and evaluation of equa- 
tions for each pair of terms. The combination of these three searches can be 
summarized by the following algorithm: 

Algorithm: Find Equation 

T ^ (A B) ; the initial list of terms for search ^1 

old-T ^ NIL ; the list of terms already used 

E ^ a set of polynomial equation models ; list of models for search ^3 
loop until new terms in T exceed threshold of complexity 
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2T ^ list of new pairs of terms created from T and old-T 

; the list generated by search ^2, initially (A B) 
for each pair in 2T and for each model in E 
hnd and evalnate the best eqnation 
if at least one eqnation accepted, then 

return all accepted eqnations and HALT the search 
old-T ^ old-T angmented with T 
T ^ list of new terms created from old-T 

For each pair of terms (either original attributes a and h or terms x and y) gen- 
erated by search and for each polynomial up to the maximum pre-specihed 
degree, search ^ 3 proposes polynomial models y = f(x, ag, . . . , Og), which are 
then solved for h, if possible, and compared with the models considered earlier. 
For each equation which comes out as a new one, the best values are found for 
the parameters (coefficients) oq, . . . , flg, and error values Sao, ■ ■ ■ , Sag for each 
parameter. Each polynomial coefficient for which \ai\ < Sa, is removed. The 
equation is accepted as a definition of & by a if the significance measure exceeds 
a threshold, set to 0.01. If that threshold is not met, a refinement process (not 
treated in this paper) applies to the equation if the significance measure exceeds 
a threshold, set by default at 0.0001. The significance is based on lest and 
the number of degrees of freedom, that is on the number of data points minus 
the number of parameters in the equation. 

Correlation analysis is often used as a measure of linearity of a relation. Our 
approach offers a far broader search for equations. Many textbook examples 
show that correlation values are close to zero (that means, no correlation) even 
though a sharp functional dependency occurs in the data. Our Equation Finder 
returns well-fitted equations in many such cases. 



3.3 Efficiency 

The functionality test operates on contingency tables. Since the size of the table 
is typically small compared to the size of data, and the test requires one pass 
through the table, it is extremely efficient. It also saves large amount of time 
because it prevents a far more costly equation finding search, when it cannot be 
successful. Generation of a contingency table is linear in the number of records. 
The number of contingency tables is linear in the number of attributes consid- 
ered. If the number of original attributes is very large, various techniques of 
feature selection can be used to reduce their number. For a comprehensive treat- 
ment of feature selection, see [3]. Sampling, in turn, can reduce the number of 
records. Equation finding is linear in the number of records and is proportional 
to the number of models considered. The space of Equation Finder search can 
be limited in different ways by setting the parameter values for each search, such 
as depth of search and the list of operators. The potentially most costly is search 
in the subsets of data, but it can be also adjusted to the available resources, by 
limiting the depth of search. 
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4 A shared semantics of equations in a Distributed 
Autonomous Knowledge System 

In this section each database in Distributed Autonomous Database Systems will 
be extended to a knowledge system. We Rrst recall the notions of an information 
system and a distributed information system. Next we define the shared meaning 
of attributes in a Distributed Autonomous Knowledge System DAKS. 

By an information system we mean a structure S = (X,A, V), where A 
is a finite set of objects, A is a finite set of attributes, and V = lJ{Va : a G A} 
is a set of their values. We assume that: 

• ha, hf) are disjoint for any a,h ^ A such that a ^ h, 

• a : X — ^ Va is a function for every a G A. 

Instead of a, we will often write a^s] to denote that a in an attribute in S. 

By a distribnted information system [9] we mean a pair DS = ({Si }iei, L) 
where: 

• / is a set of sites. 

• Si = (Xi, Ai, Vi) is an information system for any i G /, 

• L is a symmetric, binary relation on the set /, 

In this paper we assume a distributed information system DS = ({Si }iei, L) 
which is consistent, that is, 

(Vi)(Vi)(V* e Xi n Xj)(Sa e a,- n Aj) (a[s,](*) = (a[s^.])(*). 

In the remainder of this paper we assume that DS = ({Si}i^i , L) is a dis- 
tributed information system which is consistent. Also, we assume that Sj = 
(Xj,Aj, Vj) and Vj = \J{Vja : a E Aj], for any j E I. 

We will use A to name the set of all attributes in DS, A = [J{Aj : j E I}. 

4.1 Shared operational semantics 

The shared semantics (see [12]) is defined for the set A of all attributes in all 
information systems in DS. For each attribute a in A, the operational meaning 
of a is defined by: 

1. the set of (pointers to) information systems in which a is available: {Si : a E 
^i}'y 

2. the set of information systems in which a definition of a has been derived, 
jointly with the set of definitions in each information system. Definitions can 
be equations, boolean forms, etc. 

3. the set of information systems in which a definition of a can be used, be- 
cause the defining attributes are available there. An attribute a is a defined 
attribute in an information system S if: 

(a) a definition DEF of a has been discovered in an Si in DS] 

(b) all other attributes in the definition DEF are present in S'; in such cases 
they can be put together in a JOIN table and DEF can be directly 
applied. 
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4.2 Equations as partial definitions: the syntax 

We will now define the syntax of definitions in the form of equations. Partial 
definitions are included, as they are often useful. In the next subsection we give 
an interpretation of partial definitions. 

Functors are the building blocks from which equations and inequalities can 
be formed. Those in turn are the building blocks for partial definitions. Assume 
that * is a variable over Xi and ri, r2, ..., are functors. Also, we assume here 
that nij is the number of arguments of the functor r-j, j = 1 , 2 , .., k. The number 
of arguments can be zero. A zero argument functor is treated as a constant. 

By a set of s(i)-atomic-terms we mean the least set TOi such that: 

• 0,1 e TOi, 

for any symbolic attribute a ^ Aj , 

• [a(x) = w] G TOi for any a G Ai and w G Via, 

• ~ [a(x) = w] G TOi for any a G Ai and w G Via, 

for any numerical attributes a, ai, G2, ..., in Ai, 

• [a p rj(ai,a2, ..., a™^.)](*) £ TO,-, where p £ {=, <, >} 

s(i)-atomic-terms of the form [a(x) = w] and [a = rj(ai, G2, ..., amj)](®) are 
called equations. 

By a set of s(t) -partial- definitions (it s(i)-p-defs in short) we mean the least 
set Ti such that: 

• if t(x) G TOi is an equation, then t(x) G Ti, 

• if t(x) is a coniunction of s(i)-atomic-terms and s(*) is an equation, then 
[t{x)^.s{x)\(;LT„ 

• if ti{x),t2{x) e Ti, then (ti(x) V t2(x)), (ti(x) At2(x)) £ Ti. 

For simplicity we often write t instead of t(x). 

The set s(I)-p-defs represent all possible candidate definitions built from 
attributes that can come from different information systems in DS. s(I)-p-defs 
is defined in a similar way to s(i)-p-defs\ the set Vi is replaced by [J{Vj : j G /} 
and the set Ai is replaced by [J{Aj : j G /}. 



4.3 Equations as partial definitions: the interpretation 

By a standard interpretation of s(i)-p-defs in Si = (Xi,Ai, Vi) of a distributed 
information system DS we mean a function Mi such that: 

. MiiO) = 0 , Mill) = Xi 

• Mi{a{x) = w) = {x e Xi : a[5_](*) = w], 

• Mi(~ (a(x) = w)) = {x e Xi : a[s_](*) w] , 

• for any p £ {=, <, >}, 

Mi{{a p rj{ai,a 2 , ..., a™^.))(*)) = 

{x e Xi : a[s,](*) p rj{ai[s^]{x), a2[s,]{x), ..., amj[s,]{x))} , 
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• Mi([t — ^ s]) = {* e Xi : if [* e then x £ Mj-(s)]}, 

• iff 1,^2 are s(t)-p-defs, then 

Mi{ti = ^2) = (if = Mi{t2) then True else False). 

Let us assume that [tl ^ (ai(*) = wl)], [t 2 ^ (a2(x) = w 2 )] are s(t)-p- 

defs. We say that they are S'j-consistent, if either ai 02 or Mi{t\ At2) = 0 or 
wl = w 2 . Otherwise, these two s(t)-p-defs are called S'i-inconsistent. 

Similar dehnitions apply when wl and w 2 in those partial dehnitions are 
replaced by ri(ai, 02, am^)(x) and r2(ai, 02, am^)(x). 



5 Discovery layer 

In this section, we introduce the notions of a discovery layer and a distributed 
autonomous knowledge system. Also, we introduce the concept of a dynamic 
operational semantics to reflect the dynamics of constantly changing discovery 
layers. 

Notice that while in the previous sections s(i)—p—defs have been interpreted 
at the sites at which all relevant attributes have been present, we now consider 
s(I) — defs imported from site k to site i. 

By a discovery layer D^i we mean any s(i)-consistent set of s(k)—p — defs, of 
the two types specihed below, which are satisRed, by means of the interpretation 
Mk, by most of the objects in Sk'- 

• " [(« = rm(ai,a2, where 01,02,..., a™ £ Ai and a e Ak 

and t is a conjunction of atomic terms that contain attributes that occur 
both in Ai and in Ak 

• [t — ^ (a(x) = w)], where a E Ak and t satisfies the same conditions as 
above. 

Suppose that a number of partial definitions have been imported to site i 
from a set of sites Ki. All those definitions can be used at site i. 

Thus, the discovery layer for site i E I is defined as a subset of the set 
A = [}{Dki ■■ k E Ki). 

By Distributed Autonomous Knowledge System (DAKS) we mean DS = 
({(S'!, Di)}i^i, L) where L) is a distributed information system and Di 

is a discovery layer for a site i E T 

Figure 1 shows the basic architecture of DAKS (WWW interface and a query 
answering system kdQAS that can request and use s(I)-p-defs are also added to 
each site of DAKS). 

Predicate logic and i-operational semantics are used to represent knowledge 
in DAKS. Many other representations are, of course, possible. We have cho- 
sen predicate logic because of the need to manipulate s(I) — defs syntactically 
without changing their meaning. This syntactical manipulation of s(I)-defs will 
be handled by IQ AS. By designing an axiomatic system which is sound we are 
certain that the transformation process for s(I)-p-defs based on these axioms 
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Fig. 1. Distributed Autonomous Knowledge System 



either will not change their meaning or will change it in a controlled way. It will 
produce s(t)-p-defs approximating the initial s(I)-p-defs. 

Clearly, if for each non-local attribute we collect rules and equations from 
many sites of DAKS and then resolve all inconsistencies among them, the re- 
sulting rules and equations in the local discovery layer have more chance to be 
locally true. 

Let Mi be a standard interpretation of s(t)-p-defs in Si and Ci = lj{14 : 
k ^ — Vi. By /-operational semantics of s(I)-p-defs in DS = Di)} iei, L) 

where Si = (Xi , Ai,Vi) and Vi = [J{Via '■ a G Ai}, we mean the interpretation 
Ni such that: 

. NiiO) = 0, /Vi(l) = Xi 

• for any w ^ Via, 

Ni{a{x) = w) = Mi(a(x) = w), A/j'(~ (a(x) = w)) = Mi(^ (a(x) = w)) 

• for any w (E Ci Ci Vka where k i, 

Ni(a(x) = w) = {x E Xi : ([/ — ^ [a(x) = w]] E Di A x E Mi(t))} 

Xi(~ (a(x) = w)) = {x E Xi : (3t; £ Va)[{v w) A ([/ — ^ [a{x) = t;]] £ 

Di) A{x E Mi{t))]] 

• for any w E C'i HVka where k ^ i and a is a numeric attribute, 

Xi((a(*) = w)) = U{* e Xi : ( 3 y E Mk[a(y) = w = r„(ai, 02, ..., am)]) 

[Mi{[aiYS,]{x) = ai[s,](t/)]A[a2[s.](*) = a2[s,](t/)]A...A[a„[s,](*) = ai[s,](t/)]) 

A [a{y) = w = r„(ai, 02, ..., a„)] £ A} 

Xi(~ (a(x) = w)) = Xi - Ni(a(x) = w) 

• for any s(/)-terms t\,t2 

A(/i v/2) = A(C)uA(/ 2), A(~ (C v/2)) = (A(~C))n(A(~ 12)), 
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Ni{tiAh) = n Ni{h), (tiAh)) = fi)) U h)), 

Ni( t) = Ni(t). 

• for any s(/)-terms t\,t2 

Ni{ti = ^2) = ( if Ni{ti) = Ni(t2) then True else False) 

The i-operational semantics Ni represents a pessimistic approach to eval- 
uation of s(I)-p-defs because of the way the non-local s(I)-atomtc-terms are 

interpreted (their lower approximation is taken). 
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Abstract. The SPAMS are these undesirable messages that we receive 
by the slant of the electronic mail and that promise us glory and fortune 
or stun us of political slogans or violent or pornographic contents. The 
following article shows how to use techniques of data mining, like meth- 
ods of supervised learning based on induction graphs, to analyse these 
Spams in order to be able to eliminate them from our electronic mail. 



1 Introduction 

We now find, in our electronic mails, so much messages, the Spams, that goes 
from advertising to pornographic pictures and any kind of propaganda. 

USENET is strongly used by the Spamers^ to reach a thousands of sub- 
scribers quickly. Internet suppliers appraise that 30% of the received messages 
are illicit messages. We see better the size of this problem by remembering that 
between 1995 and 1998, the data volume transmitted every day via USENET [3] 
passed from 586 MB to more than 5 go. That doesn’t include messages directly 
sent in our electronic mailboxes. That is why we must try to reduce the volume 
of infiltration of these messages. 



2 Anti-Spams Techniques 

Numerous computer companies try to find tools to fight the Spams. Software 
proposed up to now are relatively simple. Most of anti-Spams softwares use 
only the messages header to filter them. Seeing that the Spammers were able to 
bypass the rules of anti-Spams softwares, programmers decided to also analyse 
the content of messages. Techniques to fight Spams using rules that combine 
keywords are not always adapted to the users reality. Other anti-spams figthing 
techniques have been tested [5], [1] 

One of tracks that appeared us promising is the one proposed by [4] that 
introduces the concept of auto-training. This paper is in keeping this perspective. 

^ By this name, we designate the Spams authors. 
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3 Data-Mining and Learning 

The limits of the classical approaches mentionned above incite us therefore to 
find an intelligent system capable to adjust itself to Spammers inventions and 
ingenuity. One solutions offers to us is to use Data-Mining techniques and more 
especially supervised training techniques. 

In this reasearch, we set oneself two hierarchical target: 

— The first is to create a software able to automaticly learn how to recognize 
Spams. 

— The second aims to make this training incremental in order to allow the 
software to adjust to the new techniques of the Spammers. 

There is a spam message archives available on Internet ftp.spam-archive.org. 
The idea is to donwload these undesirable messages and to attach them messages 
that we could qualify of legal. These last could be all our stocked messages. At 
the end, we get a training basis and test for training algorithms. 



4 Learning Data 

We downloaded on the FTP site ftp.spam-archive.org, 10 000 English Spams. 
Training techniques that we consider to use work on table of data attribute- 
value, i.e. table of two dimensions in which rows represent messages and columns 
the attributes of each messages. 

The Messages are in a natural state, i.e. in a variable length text whose 
format doesn’t distinguish itself of others electronic messages that we receive 
everyday. 

The first question is therefore to know how to transform these texts in a 
vector of descriptors “attribute=value” . We use an approach based on n-grams 
that we are going now to describe. 

The trigrams technique uses the frequency of three letter sequences in a big 
sample of a given language. The idea is to capture the intuition according to 
which it is, for example, more likely for a word that ends by “-ed ” to be a 
English word, as well as it is more likely for a word ending by “-ez ” to be 
French [6]. A trigram is a following of three characters in a text. For example, 
the word “dollar” include the trigrams “dol” , “oil ” , “11a ” and “lar ” . Trigrams 
are used in research on the natural language for grammatical analysis [2] , for the 
orthographic spell checkers, for automatic character recognition systems, etc. 

If we code every available message in the examples basis according to the 
frequency of a trigramme and that we wish an universal program, we risk to 
generate vectors that would reach 300 trillions of attributes. Theoretically we 
could certainly work on such descriptions but, in fact, few machines would sup- 
port such volume of data. Actually some trigrams are not present whereas others 
are too frequent in all types of messages to be revealing of something. The idea 
is to study the manner to select informative trigrams for the training. 
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The least frequent trigrams can not be most applicable, because they are 
too specific. To determine this subset of trigrammes, we made a compilation of 
trigrams in 200 messages judged Spams that we took on the FTP site: ftp. spam- 
archive. org. Then, we ordered these trigrams according to their number of oc- 
curence in the 200 messages. 

On the basis of the 200 analyzed Spams , we found 15889 different trigrams 
of which some have been observed more than 18000 times. 

We selected arbitrarily 225 trigrams^ having a frequency of apparition that 
appeared us reasonable: present at least 15 times. Currently, we study other 
strategies for selection. 

The training file Qa is composed with 400 messages from which 200 are Spams 
messages and 200 messages non-spams^. For each message, w we calculated the 
frequency of each 225 trigrams selected and added an attribute C indicating the 
class of adherence of the message. If the w message is a Spam, we have C (w) = 1 
otherwise, C {lj) = 2. 

All other available messages have been used subsequently as sample of vali- 
dation. 



5 Training Method 

We used the SIPINA method [7]. This method generalizes the notion of decision 
tree since it drives to graphs of induction no arborescent. 

In the case of our study, all attributes are quantitative, the procedure of 
discretization rely upon a method called Fusbin [7] [8] that aims to determine, 
according to the criteria indicated higher, the best cutting point producing a 
bi-partition. 

We have to specify that we used several strategies like CART [11], C4.5 [10], 
Chaid, etc. The goal of this paper is not to compare different methods but to 
show the feasibility of a anti-spams system combining the concept of trigram 
and techniques of data-mining. 



6 Results 

The graph generated by SIPINA in the learning phase is given by the figure 1 : 
We note therefore that there are only two trigrams, the number 72 ”Ts!” and 
the number 184 ”Sgr”, which almost separate the totality of two messages class. 
We can note that the rate of good recognition is 97 %. We can observe that on 
the 200 Spams only three have been classified in the second class and for the 
non-spams seven have only been classified in the first class. 

^ We choose 255 trigrams to reduce the computation time. Theoricaly, there is no 
limits. 

® Non-spams have been provided by the Dr., Jacques Gelinas, PhD from the Decision 
Support Technologies team in Defence Research Establishment Val-Cartier (Canada) 
that we want to thank. 
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Fig. 1. Learning phase induction graph. 



The validation file, that doesn’t contain message of the training sample, is 
composed of 321 Spams and 279 Non-spams. The rate of recognition gotten on 
this validation sample is 97%, what confirms the training result. 

7 Spam Miner 

Spam Miner is an intelligent software agent which is capable: 

— To find trigramms in a text, 

~ To compile these trigramms, 

— To find the 225 necessary trigrams for the construction of a data base, 

— To do the data mining (with SIPINA or another algorithm), 

~ To construct classification rules of messages from results of the knowledge 
extraction, 

— To apply rules of classification found. 

During an initial phase of intensive training, the user sorts out messages for 
the agent indicating which messages are Spams and which are not. The agent 
compiles the Spams and builds the list of the 225 necessary trigrams for the 
construction of the data base then constructs it. Applying the algorithm of data 
mining chosen, the agent finds a first set of rules to classify messages. Then 
comes a phase of validation and adjustment. The agent sorts out messages that 
enter in the mail and classify then in two groups via a system of flag: the group 
of Spams and the group of Non-spams. The user verifies that the agent classifies 
messages well and indicates him its mistakes. The agent will use these corrections 
to modify its 225 trigrams and to improve its data base and its classification 
rules. Days after days, the agent will become more effective, the user will trust 
him more and more and will have nearly no need to be preoccupied with the 
classification of its messages. 

Spam Miner is endowed with a convivial graphic user interface, working under 
Windows. A user guide is presented in appendix. A copy of the software can be 
gotten freely by sending an e-mail to authors. 
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8 Conclusion and Perspectives 

In this work, we tried to show that it is possible to recognize a certain type of 
message as soon as we dispose of a set of examples. Some messages that we used 
for the training phase and the validation are specific, like the notion of Spams. 
Indeed, this concept can be, from a user to an other, variously discerned and 
go to lead to different results because training bases and validation would be 
different. The interest of our study lie in the quality of the step used rather than 
the general reach or not of the identification rules of the Spams. 

The recommended methodology is applicable on other languages using the 
UNICODES characters by using the software Spam Miner directly. And more, 
the automatic training possibility makes the system more universal and more 
robust to change because it tris to constantly improve itself. Evidently there is 
always a space for the improvement of the concept and the performances. This 
technique could also be used for other purposes. For example, the process could 
be used by the available search bots on Internet [12]. We could also use this 
process to filter WEB pages accessed by our children and refuse their access if 
their content isn’t suitable. 
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Abstract. There is a wealth of information to be mined from the World 
Wide Web. Unfortunately, standard natural language processing (NLP) 
extraction techniques perform poorly on the choppy, semi-structured in- 
formation fragments, such as sports results, which are popular to be 
published on the Web pages nowadays. In this paper, we present an in- 
formation agent: SportsFinder, an agent to extract sports scores from 
the World Wide Web, as well as the knowledge discovering method to 
learn new express patterns to improve the agent’s performance. 



1 Introduction 

A wealth of on-line information can be made available to automatic processing 
by information extraction (IE) systems. Each IE application needs a separate 
set of rules tuned to the domain and writing style, which creates a knowledge 
engineering bottleneck [8]. 

This paper examines an alternative: learning the express patterns of on-line 
information and using these patterns to improve the IE system’s performance. 
The domain of sports results was chosen for this research because of the fact that 
it highlights the contrast between the uniformity and diversity of information on 
the Web. The great popularity and appeal mean that very few generalisations can 
be made about people who publish sports results on-line. Consequently, there is 
a wide variety of different formats, languages and conventions used on sporting 
Web sites world-wide. It can be seen that the uniformity provided by conventions 
and the unambiguous nature of the results make the domain a plausible test bed 
for information extraction. 



2 Agent Architecture 

We present an information agent called SportsFinder, an agent to extract sports 
scores from the Web. The overall agent architecture is shown in Fig. 1. The agent 
is comprised of Knowledge Bases, Web Page Retriever, Game Unit Isolator, Score 
Expert and User Interface. All these components are implemented in Java. 
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Fig. 1 . SportsFinder’s Architecture 



User Interface. It is the interface between the user and the agent. User 
Interface gets the user’s query and provides SportsFinder’s result to the user. If 
the user’s desired sport or competition is not in SportsFinder’s knowledge bases, 
it will provide the dialog for users to add new knowledge to the system. The 
acquired knowledge from the user is added to relevant knowledge bases. 

Knowledge Bases. They store the information needed to extract sports 
scores. There are two knowledge bases in this system: the world knowledge base 
and the domain knowledge base. The world knowledge base contains the infor- 
mation about where the sports results Web site is, and the protocol to contact 
them; the domain knowledge base contains the knowledge of a certain sport 
result, such as the biggest score one team can get in a soccer match. 

Web Page Retriever. The task of the Web Page Retriever is to take the 
URL of the page where the results are stored and connect to that URL. If the 
URL is invalid, an error message is returned to the user. Otherwise, the Web 
Page Retriever fetches the content of that URL and passes it to the Game Unit 
Isolator. 

Definition 1 . Information Source: The content of the URL which stores sports 
scores SportsFinder extract from is an Information Source, denoted as TS . 

Definition 2 . Game Unit : Let St = < Di, D2, ■ ■ ■ , Dn > . If 3 « Name 

Given A 3 j, k that Dj,Dk G Number then St is a Game Unit. 

Game Unit Isolator. Game Unit Isolator segments information source into 
game units, which are the candidates to be extracted for sports scores, we for- 
malise this process as: XS — > 81,82, ■■■ , Sm- Game Unit Isolator first looks for 
the user’s selected name. It then finds the nearest HTML tag couples closing 
the given name, and if this segment is a Game Unit, then extracts its express 
pattern. When the user wants to know the standing of a player. Game Unit Iso- 
lator segments the information source into game units according the extracted 
express pattern. Once a Game Unit is identified, it is passed to the Score Expert. 
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Score Expert. Score Expert extracts and validates the sports scores from 
a Game Unit based on domain knowledge. In a Game Unit, there maybe several 
numbers to be the game score, score expert heuristically guesses which is mostly 
to be the game score based on the domain knowledge of a given sport. 

3 Algorithm 

3.1 Pattern Scanner 

Instead of a fully natural language understanding method, we use the express 
patterns to recognise and extract the sports scores. It is just like semi understand- 
ing of the text. Although the sports results are very simple, their expressions in 
HTML vary wildly. We use HTML tags and some special characters to present 
the score patterns. For example, the HTML source for a soccer result is: 

<TR> <TD WIDTH=180> ARSENAL </TD> 

<TD WIDTH=50> 5-0 </TD> 

<TD WIDTH=180> BARNSLEY </TD> 

</TR> 

and we define its express pattern as: 

Pattern: : <TR> <TD *> TearniA </TD> 

<TD *> ScoreA - ScoreB </TD> 

<TD *> TeamB </TD> 

</TR> 

the wild card means to skip any number of characters until the next 
occurrence of the following term in the pattern, and is a special character. 
Pattern scanner extracts the express patterns of a game unit. 

3.2 Fuzzy Pattern Comparison 

This algorithm calculates the similarity of two patterns. The similarity measure 
used here allows for arbitrary length deletions and insertions, that is to say the 
algorithm measures the biggest possible similarity of two patterns under certain 
allowed mismatches and internal deletions. 

Let the two express patterns be A = aia 2 ■ ■ ■ an and B = 6162 A 

similarity s(a, b) is given between pattern elements a and b. Deletions of length k 
are given weight Wk- To find the high degrees of similarity, we set up a matrix H . 
Hij is the maximum similarity of two segments ending in ai and bj, respectively. 

First set Hko = Hgi = 0 for 0 < k < n and 0 < I < m. 

H^j = max{iLi_ij_i -P s(oi, bj), max{iLi_fej - Wk}, max{iLij_i - Wi}, 0} 

fc>l 1>1 

for 1 < i < n and I < j < m. 

The formula for Hij is calculated as fellows: 
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1. If Qi and bj are associated, the similarity is + s(aj, bj). 

2. If Oi is at the end of a deletion of length k, the similarity is Hi-k,j — Wk- 

3. If bj is at the end of a deletion of length I, the similarity is Hij-i — Wi. 

4. Finally, a zero is included to prevent calculated negative similarity, indicating 
no similarity up to and bj . 

Definition 3. Possible Similarity: We define a function: PosSim{A, B) ^ 
[0, 1] to measure the biggest possible similarity of express patterns A and B. 

n m 

PosSim{k,B)= max {H^j} / (^y^s{ai,bj) - W\m-n\) 

0<2<n,0<7<m 

2=0 i =0 

If PosSim{A, B) is greater than a threshold, then we consider A and B as the 
same express pattern. The algorithm for express pattern learning is summarised 
in Fig. 2. 



1. Segment the IS into Game Units. 

2. Scan the Game Unit’s express pattern. 

3. Match the Game Unit’s express pattern with known patterns. If the value of 
PosSim{A, B) is greater than a threshold, success; Otherwise try another known 
pattern. If no pattern matched, back to step 2, select another Game Unit. 

4. Validation. Validate the likely Team Name, query it to the known IS. 

5. Feedback. Present the user the result, if it is wrong try again or learn the pattern 
interactively. Add the recently learned pattern to the base. 



Fig. 2. Algorithm for Learning New Express Patterns 



4 Empirical Results 

Our preliminary testing indicated that SportsFinder is able to successfully ex- 
tract sports scores for a variety of sports with a high success rate. One test 
was on five randomly selected teams in each of nine competitions. In this 45 
trials, SportsFinder was able to extract the correct results in 43 of them. The 
result is encouraging. And most impressively started with limited known express 
patterns, SportsFinder can improve its performance by learning new express pat- 
terns. This makes SportsFinder able to deal with the dynamic change of sports 
sites. Using the fuzzy pattern comparison algorithm, SportsFinder can calculate 
the position of a given team or player in a sports result ladder, like the results 
of golf and cycling. For example, in Fig. 3, by comparing the express patterns 
between lines, SportsFinder can give that Vijay Singh’s position in 1997 NEC 
World Series of Golf Tournament is No. 6. 
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Fig. 3. (a) A Golf Tournament Web Page, (b) SportsFinder’s Result. 



5 Conclusion 

As the Web grows dramatically, more and more new contents and express pat- 
terns will be available. SportsFinder is a step towards quick and easy extraction 
of needed information from the Web without having to rely on specialised pro- 
grammers. In our test, SportsFinder can extract an ever-widening diversity of 
types of sports scores by learning new express patterns. 
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Abstract. Event Mining discovers information in a stream of data, or 
events, and delivers knowledge In real-time. Our event processing en- 
gine consists of a network of event processing agents (EPAs) running 
in parallel that interact using a dedicated event processing infrastruc- 
ture. EPAs can be configured at run-time using a formal pattern lan- 
guage. The underlying infrastructure provides an abstract communica- 
tion mechanism and thus allows dynamic reconfiguration of the commu- 
nication topology between agents at run-time and provides transparent, 
location-independent access to all data. These features support dynamic 
allocation of EPAs to machines in a local area network at run time. 



1 Introduction 

Event mining (EM) delivers knowledge about a complex system in real-time 
based on events that denote the system’s activities. A system can be anything 
from a single semiconductor fabrication line to the interconnected check-out 
registers of a nation-wide retailer. Such systems may be probed to produce events 
as the system operates. Events are then mined in a multitude of ways: Unwanted 
events are filtered out, patterns of logically corresponding events are aggregated 
into one new complex event, repetitive events are counted and aggregated into 
a new single event with a count of how often the original event occurred, etc. 
This mining process of producing fewer “better” events out of many “lesser” 
events can be iterated. The presentation of the mined events to the user is 
virtually unlimited. EM is particularly well suited for event based systems, but is 
applicable to other systems as well, e.g. updates in a database can be interpreted 
as events. The following two applications are typical examples of EM: 

Business applications: EM based real-time decision support systems con- 
stantly gather information from throughout the enterprise and immediately 
respond to changes in information. These systems are business event driven, 
where a business event represents any significant change in business data or 
conditions. 

Enterprise network and systems management: Event patterns that may 
lead to a failure (e.g. an important disk filling up) or that could signal 
break-in attempts (i.e. connect requests to multiple targets from a single 
source over a short time) are detected as they occur. EM provides immedi- 
ate notification of such conditions to the managers of large, mission critical 
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networks. Automatic prioritizing of alerts and quick root cause analysis leads 
to reduced response time, higher up-time and allows network managers to 
quickly respond to critical situations. 

In order to understand complex systems and efficiently deal with complex 
patterns of events, logging with just a time stamp is often not enough. The two 
following features greatly increase the power of EM: 

Complex event structures: Events should be stored as complex objects to- 
gether with relationships among them instead of just tuples in a relational 
sense. EM should support event relationships beyond time, e.g. causality: one 
event causes another. In today’s networked real-time environments events 
come from multiple independent sources and not all events are ordered in 
respect to each other. If such a natural partial order of events is implicitly 
reduced to a total order in logging, information is lost and non-determinism 
is introduced [1]. 

Flexibility: Because EM happens in real-time, queries need not be hard coded, 
but be must be flexible, and configurable at runtime. It should be possible at 
any time to start a new query against an ongoing event stream, that either 
considers only new events, only old events, or both. 

EM supporting these two features is part of Stanford University’s RAPIDE 
project. We developed an extensive set of tools that supports logging, mining, 
storing, and viewing of events in real-time. RAPIDE events are related by time 
and cause. Each relation builds a partial order on all the events. A formal pat- 
tern language [2] supports the construction of filters and maps, constructs that 
aggregate simple events to complex events on a higher level of abstraction [3]. 
The same process can be used to query complex events, thus building a more 
and more abstract view of the system. Our tools are implemented and available 
for Sun/Solaris 2.6 and Linux and can process several hundred events per second 
on an Ultra 1. We are currently negotiating with pilot users in industry. 



2 Event Processing Networks 

The RAPIDE EM technology is based on the concept of Event Processing Net- 
works (EPNs) . Such networks consist of any number of Event Processing Agents 
(EPAs), namely event sources, event processors and event viewers. Fig. 1 shows 
an overview over the three categories, with thin arrows indicating the (logical) 
flow of events from sources through processors to viewers. 

Event sources in our applications are typically middleware sniffers. The sys- 
tem middleware can be pure TCP /IP, an event communication service based on 
a proprietary protocol like TIBCO Inc.’s TIB or Vitria, Inc.’s Communicator, or 
a military standard like the MIL STD 1553. We also automatically instrument 
the source code of system written in Java to intercept events within the Java en- 
gine [4]. Typical examples for event processors are filters and maps. Filters pass 
on only a subset of their input, maps aggregate multiple events in the input to 
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Fig. 1. Event Mining (EM) 



output events thus generating events on a higher level of abstraction. Any third 
party event processor can be inserted into an EPN allowing for the integration 
with other approaches. Typical event viewers are a graphical viewer for partially 
ordered sets of events, a tabular viewer of event frequency or a simple gauge 
metering the value of an important parameter. 

Data needs to be stored persistently because agents may want to access past 
events, even long after they have happened. Also the number of objects currently 
under consideration may easily exceed the size of the available main memory, 
thus EM requires some way of storing objects temporarily to disk. RAPIDE EM 
includes a shared data store that keeps track of all the objects. New objects 
are written into the data store from where agents and viewers read them. A 
communication service notifies other EPAs when new objects are added. 

Events flow through the EPN in real-time and are displayed in viewers as soon 
as they are created, limited only by the speed of the underlying infrastructure. 
Processed events are displayed in viewers shortly after the underlying events 
have been created by the event source. EPNs are dynamic in that all EPAs 
can be added and removed at runtime. Newly added agents can either ignore 
all previous events and just start with the current event at the time they are 
added, or they can try to catch up all events from the beginning. As EPNs are 
distributed, EPAs can reside on machines distributed across a network. 



3 Real Time Pattern Queries 

The RAPIDE pattern language allows the user to describe patterns of events. A 
RAPIDE pattern matcher searches for all occurrences of a pattern of events in a 
partially ordered set. A typical example would search for all A events that cause 
both a B and a C event, with B and C independent of each other. In RAPIDE 
this pattern could be specified as: A ^ (R ~ C). In OQL, clumsily enhanced 
with a * operator denoting one or several repetitions of the path expression, this 
query would look like: 
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select tupleCe.ID, f.ID, g.ID) 

from event e, e . (successor) * f, e . (successor) * g 
where e.type=’A’ and f.type=’B’ and g.type=’C’ and 
(NOT f . (successor) *=g) OR (NOT f . (successor) *=g) OR f=g) 

Executing this query from scratch whenever a new event is added to the set is 
very inefficient, as potentially the whole set has to be traversed. Also, computing 
the complete transitive closure of the successor relation as a derived relationship 
is not feasible in a real-time setting. The algorithms we use instead were originally 
inspired by [5]. Our pattern matching algorithm searches whatever it can on the 
available data and keeps partially completed results around if possible. When 
new data arrives, only the partial results which might possibly benefit from the 
new event need to be reinvestigated. 

We call this process incremental query execution or incremental queries. In- 
cremental queries return new hits as new records are inserted and optimize re- 
execution of an ongoing query when new objects are inserted. This optimization 
is a trade off between storing all partial results on the one hand and rebuilding all 
partial results on the other hand. Incremental queries are similar to materialized 
views with real-time constraints. For our purpose, we can think of incremental 
queries as a repository of ongoing queries, along with some state information on 
these queries. Every time a new event is added, the queries of this repository 
would be allowed to run on that element only and the requesting client would 
be notified if there are new hits. Incremental queries require two interdependent 
modules: 

Notification (call-backs, triggers) that notifies interested clients of insertions 
and updates on specific objects in the database, and 
Dynamic Adaptation that modifies the current query execution plan depend- 
ing on the newly inserted object and runs the query. This must be done effi- 
ciently, e.g. the query tree should be executed in such a way that a minimal 
amount of work is redone. We believe that these two elements are useful even 
beyond implementing pattern matching for EM: e. g. for rule processing in 
real time expert systems [6]. 

The commercial OODBMS that we looked at had very limited support for 
notification: most of them require polling the database for new events. With 
polling, throughput does not scale with the size of the database because search- 
ing time for any new object is not constant. One way out is to partition the 
database. However, big partitions do not help much, and small partitions in- 
crease the number of partitions which adds the overhead of keeping track of 
them. Worse, references between partitions are slower than references within a 
partition, reducing throughput. In addition, polling has to be done by all read- 
ers individually, increasing the load in an event processing network. Overall, our 
experiments with polling lead to a throughput of only a few objects per second. 
Hence having readers poll the database for new events is completely unrealistic 
for our purposes. Only one of the commercial OODBMS we looked at offered 
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call-backs (triggers) on certain changes in the database, but so slow that we 
could only hope to notify a few objects per second. But clearly, efficient call- 
backs with minimal delay is critical. To support our requirements, we added our 
own notification mechanism. 

4 Mining Event Patterns 

Given an infrastructure for building large databases of events and their temporal, 
causal, and data attributes, along with a formal pattern language for expressing 
relationships between events in a compact and expressive way, then event mining 
is the process of extracting patterns from large sets of events in real time. 

Our initial experiments in this area focus on using statistical analysis of 
stored relationships between events (causality, equivalent data parameters) to 
identify common yet complex behaviors implied by the events. 

The patterns extracted via event mining may then be used to initiate further 
event processing. For example, they may be used to filter out normal event 
behavior of a system, so that variations of it may be examined. Or, the patterns 
extracted may be aggregated into higher level events, to allow views of the event 
activity at a more abstract level. 

A critical factor for real-time event mining is the need to process each new 
event in constant time. Otherwise incoming events will eventually start to queue 
up and lead to a big back log. Heuristic methods that are effective and efficient 
enough are one area of future research. 
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Abstract. In this paper, we analyze quantitative measures associated 
with if-then type rules. Basic quantities are identified and many existing 
measures are examined using the basic quantities. The main objective is 
to provide a synthesis of existing results in a simple and unified frame- 
work. The quantitative measure is viewed as a multi-facet concept, repre- 
senting the confidence, uncertainty, applicability, quality, accuracy, and 
interestingness of rules. Roughly, they may be classified as representing 
one-way and two-way supports. 



1 Introduction 

In machine learning and data mining, the discovered knowledge from a large 
data set is often expressed in terms of a set of if-then type rules [7,21]. They 
represent relationships, such as correlation, association, and causation, among 
concepts. Typically, the number of potential rules observable in a large data set 
may be very large, and only a small portion of them is actually useful. In order 
to filter out useless rules, certain criteria must be established for rule selection. 
A common solution is the use of quantitative measures. One may select the 
rules which have the highest values. Alternatively, one may choose a threshold 
value and select rules whose measures are above the threshold value. The well 
known IDS inductive learning algorithm [23] is an example of the former, and the 
approach for mining association rules in transaction databases [1] is an example 
of the latter. The use of quantitative measures also play a very important role in 
the interpretation of discovered rules, which provides guidelines for the proper 
uses of the rules. 

Many quantitative measures have been proposed and studied, each of them 
captures different characteristics of rules. However, several important issues need 
more attention. Different names have been used for essentially the same mea- 
sure, or a positive monotonic transformation of the same measure (called order 
preserving transformation [15]). Additional measures are being proposed, with- 
out realizing that the same measures have been studied in related fields such as 
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expert systems, pattern recognition, information retrieval, and statistical data 
analysis. The relationships between various measures have not been fully in- 
vestigated. There is clearly a need for a systematic study on the interpretation, 
classification, and axiomatization of quantitative measures associated with rules. 
Important initial studies have been reported by Piatetsky-Shapiro [25] , and Ma- 
jor and Mangano [17] on the axiomatic characterization of quantitative measures, 
and by Klosgen [15] on the study of special classes of quantitative measures. 

This paper may be viewed as a first step in the study of quantitative mea- 
sures. A simple set-theoretic framework is suggested for interpreting if-then type 
rules. Basic quantities are identified and many existing measures are examined 
using the basic quantities. The results may lay down the groundwork for further 
systematic studies. 

2 The Basic Framework and Basic Quantities 

Consider an if-then type rule of the form: 

IF E THEN H with ai,...,a^, (1) 

which relates two concepts E and H. For clarity, we also simply write E — > El. 
A rule does not necessarily represent a strict logical implication, with logical 
implication as the degenerate case. The values ai,...,am quantifies different 
types of uncertainty and properties associated with the rule. In principle, one 
may connect any two concepts in the above rule form. The quantities Oi , . . . , am 
measures the degree or strength of relationships [34]. Examples of quantitative 
measures include confidence, uncertainty, applicability, quality, accuracy, and 
interestingness of rules. 

We use the following set-theoretic interpretation of rules. It relates a rule to 
the data sets from which the rule is discovered. Let U denote a finite universe 
consisting of objects. Each object may be considered as one instance of a data 
set. If each object is described by a set of attribute-value pairs, the concepts E 
and H can be formally defined using certain languages, such as propositional 
and predicate languages [15]. We are not interested in the exact representation 
of the concepts. Instead, we focus on the set-theoretic interpretations of con- 
cepts [13,18,22]. For a concept E, let m{E) denote the set of elements of U that 
satisfy the condition expressed by E. We also say that m{E) is the set of ele- 
ments satisfying E. Similarly, the set m{H) consists of elements satisfying H. 
One may interpret to as a meaning function that associates each concept with 
a subset of U. The meaning function should obey the following conditions: 

m{-^E) = U — m{E), 
m{E A H) = m{E) n m{H), 

m{E V H) = m{E) LI m{H) , (2) 

representing the sets of elements not satisfying E, satisfying both E and H, and 
satisfying at least one of E and H, respectively. With the meaning function to, a 
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rule E — !■ H may be paraphrased as saying that “IF an element of the universe 
satisfies E, THEN the element satisfies i/” . 

Using the cardinalities of sets, we obtain the following contingency table 
representing the quantitative information about the rule E — > H : 





H 




Totals 


E 


\m{E) n m(iJ) 


\m{E) n m{^H)\ 


\m{E)\ 


~^E 


\m{-^E) n m(iJ) 


\m{^E) n m{^H)\ 


\m{^E)\ 


Totals 


\m{H)\ 




\u\ 



where | • | denotes the cardinality of a set. For clarity, we rewrite the table as 
follows: 





H 


-nH 


Totals 


E 


a 


b 


a + b 


-nE 


c 


d 


c + d 


Totals 


a + c 


b+d 


a + b + c+ d = n 



The values in the four cells are not independent. They are linked by the constraint 
a + b+ c+ d = n. The 2x2 contingency table has been used by many authors 
for representing information of rules [9,11,27,29,33]. From the contingency table, 
we can define some basic quantities. 

The generality of E is defined by: 



G{E) 



\m{E)\ 

\U\ 



a + b 
n 



(3) 



which indicates the relative size of the concept E. A concept is more general if 
it covers more instances of the universe. If G{E) = a, then (100a)% of objects 
in U satisfy E. The quantity may be viewed as the probability of a randomly 
selected element satisfying E. Obviously, we have 0 < G{E) < 1. 

The absolute support of H provided by E is the quantity: 



AS{H\E) 



\m{H) n m{E)\ 
\m{E)\ 



a 

a-\- b 



(4) 



The quantity, 0 < AS{H\E) < 1, shows the degree to which E implies H. If 
AS{H\E) = a, then (100a)% of objects satisfying E also satisfy H. It may be 
viewed as the conditional probability of a randomly selected element satisfying H 
given that the element satisfies E. In set-theoretic terms, it is the degree to which 
m{E) is included in m{H). Clearly, AS{H\E) = 1, if and only if m{E) C m{H). 
The change of support of H provided by E is defined by: 

CS(//|£) = .4S(H|E) - GiH) = ~ (5) 



Unlike the absolute support, the change of support varies from —1 to 1. One 
may consider G{H) to be the prior probability of H and AS{H\E) the posterior 
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probability of H after knowing E. The difference of posterior and prior prob- 
abilities represents the change of our confidence regarding whether E actually 
causes H. For a positive value, one may say that E causes i/; for a negative 
value, one may say that E does not cause H . The mutual support of H and E is 
defined by: 



MS{E,H) 



\m{E) n m{E[)\ 
\m{E) U m{E[)\ 



a 

a + b + c 



( 6 ) 



One may interpret the mutual support, 0 < MS{E, H) < 1, as a measure of the 
strength of the double implication E < — > El. It measures the degree to which E 
causes, and only causes, H. The mutual support can be reexpressed by: 



MS{E,H) = 1 - 



\m{E)Am{H)\ 
|m(£') U m{H)\ ’ 



( 7 ) 



where AAB = {AVJ B) — {A\E B) is the symmetric difference between two sets. 
The measure |AZ\i?|/|AUi?| is commonly known as the MZ metric for measuring 
distance between two sets [19]. Thus, MS may be viewed as a similarity measure 
of E and H. 

The degree of independence of E and H is measured by: 



IND{E,H) 



G{E A H) 
G{E)G{H) 



an 

{a + b){a + c) 



( 8 ) 



It is the ratio of the joint probability oi E A H and the probability obtained 
if E and H are assumed to be independent. One may rewrite the measure of 
independence as [10]: 

IND{E,H) = ^^^. (9) 

It shows the degree of the deviation of the probability of H in the subpopula- 
tion constrained by E from the probability of H in the entire data set [16,31]. 
With this expression, the relationship to the change of support becomes clear. 
Instead of using the ratio, the latter is defined by the difference of AS{El\E) and 
G{H). When E and El are probabilistic independent, we have GS{El\E) = 0 
and IND{E,H) = 1. Moreover, CS{H\E) > 0 if and only if IND{E, H) > 1, 
and GS{H\E) < 0 if and only if IND{E,H) < 1. This provides further support 
for use of GS as a measure of confidence that E causes H. However, GS is not 
a symmetric measure, while IND is symmetric. The difference of G{H A E) and 
G{H)G{E)-. 

D{H,E) = G{H AE) -G{H)G{E), (10) 

is a symmetric measure. Compared with D{H,E), the measure CS{H\E) may 
be viewed as a relative difference. 

The generality of a concept is related to the probability that a randomly 
selected element will be an instance of the concept. It is the basic quantity from 
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which all other quantities can be expressed as follows: 

G{H A E) 



AS{H\E) = 
CS{H\E) = 
MS{E,H) = 



G{E) ’ 

G{H A E) - G{H)G{E) 
G{E) 

G{E AH) 

G{E V H) ’ 

TMn(F m G{EAH) 
IND{E,H) 

D{H, E) = G{H AE)- G{H)G{E). 



( 11 ) 



From the above definitions, we can establish the following relationships: 

G{E) = AS{E\U), 

GS{H\E) = {IND{E,H) - 1)G{H), 

AS{H\E) = ^^AS{E\H), 

MS{E,H) = ^ 

AS{E\H) + AS{H\E) ^ 

D{H, E) = GS{H\E)G{E). (12) 



In summary, all measures introduced in this section have a probability related 
interpretation. They can be roughly divided into three classes: 

generality: G, 

one-way association (single implication): AS, CS, 

two-way association (double implication): MS, IND, D. 

Each type of association measures can be further divided into absolute support 
and change of support. The measure of absolute one-way support is AS, and 
the measure of absolute two-way support is MS. The measures of change of 
support are GS for one-way, and IND and D for two-way. It is interesting to 
note that all measures of change of support are related to the deviation of joint 
probability of E A H from the probability obtained if E and H are assumed 
to be independent. In other words, a stronger association is presented if the 
joint probability is further away from the probability under independence. The 
association can be either positive or negative. 



3 A Review of Existing Measures 

This section is not intended to be an exhaustive survey of quantitative measures 
associated with rules. We will only review some of the measures that fit in the 
framework established in the last section. 
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3.1 Generality 

The generality is one of the two standard measures used for mining association 
rules [1]. For rule E — > H, the generality: 

G{EAH) = - (13) 

n 

is commonly known as the support of the rule. It represents the percentage of 
positive instances of E that support the rule. On the other hand, the generality 
G{E) is the percentage of instances to which the rule can be applied. Iglesia 
et al. [13] called the quantity G{E) the applicability of the rule. Klosgen [15] 
referred to it as a measure of coverage of the concept E. 

3.2 One-Way Support 

The absolute support AS{E[\E) is the other standard measure used for mining 
association rules [1], called confidence of the rule E — > H. Different names were 
given to this measure, including the accuracy [13,29], strength [8,15,26], and 
certainty faetor [15]. In the context of information retrieval, the same measure 
is referred to as the measure of precision [32]. Tsumoto and Tanaka [29] used 
the quantity AS{E\E[) for measuring the coverage or true positive rate. It is 
regarded as a measure of sensitivity by Klosgen [15]. The same measure was also 
used by Choubey et al. [5] . In the context of information retrieval, the measure is 
referred to as the measure of recall [32]. The use of change of support GS{H\E) 
was discussed by some authors [4,25]. 

Additional measures of one-way support can be obtained by combining basic 
quantities introduced in the last section. Yao and Liu [31] used the following 
quantity for measuring the significance of a rule E — > H : 

S,{H\E) = AS{H\E)\ogIND{E,H) = (14) 

a + b {a + b){a + c) 

The measure is a product of a measure of one-way support AS{S\E) and the 
logarithm of a measure of two-way support IND{E,H). Since logarithm is a 
monotonic increasing function, it reflects the properties of IND{E, H). Gray 
and Orlowska [10] proposed a measure of one-way support, called measure of 
rule interestingness, by combining generality and absolute support: 

i(H\E) = {IND{E,HY - 1) G{E A nr = (15) 

where I and m are parameters to weigh the relative importance of the two mea- 
sures. Klosgen [15] studied another class of measures: 

K{H\E) = G{EY{AS{H\E) - G{H)). (16) 

It is a combination of generality and change of support. When a = 0, the measure 
reduced to the change of support. 
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Following Duda, Gasching, and Hart [6], Kamber and Shinghal [14], Schlini- 
mer and Granger [24] used the measure of logical sufficiency: 



LS{H\E) 



AS{E\H) 

AS{E\^H) 



a{b + d) 
b{a + c) 



and the measure of logical necessity. 



LN{H\E) 



ASffiE\H) 

ASffiE\^H) 



c{b + d) 
d{a + c) 



(17) 



(18) 



McLeish et al. [20] viewed LS as the weight of evidence if one treats E as a piece 
of evidence. A highly negative weight implies that there is significant reason to 
belief in and a positive weight supports H. It should pointed out that weight 
of evidence plans an important rule in Bayesian inference. Ali, Manganaris and 
Srikant [2] defined the relative risk of a rule E — > H as follows: 



r{H\E) 



AS{H\E) 

AS{H\^E) 



a{c + d) 
c(a + h) 



(19) 



It is fact related to the measure LS^ if one change the places of E and El. 

Based on the probability related interpretation of AS{H\E), Smyth and 
Goodman [28] defined the information content of rules. For E — > H, we have: 



J{H\ \E) = G{E) (^AS{H\E) log + ASffiH\E) log ) 

= G{E) {AS{H\E) logIND{E, H) + ASffiH\E) log INDffiH, E)) 
1 / , an , , bn \ 



This measure is closely related to the divergence measure proposed by Kullback 
and Leibler [12]. 



3.3 Two-Way Support 

The measure of independence IND has been used by many authors. Silverstein 
et al. [27] referred to it as a measure of interest. Biichter and Wirth [3] regarded 
it as a measure of dependence. Gray and Orlowska [10] used the same measure, 
and provided the interpretation given by equation (9). 

The measure of two-way support corresponding to equation (14) is given by 
Yao and Liu [31] as: 

n nrt 

52 (A, H) = G{E A H) \ogIND{E, H) = - log (21) 

n (a -I- o)(a -I- c) 



By setting I = m = 1 in equation (15), we have: 



i{H\E) = IND{E, H)D{E, H), 



( 22 ) 
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which is a multiplication of two basic measures of two-way support. By setting 
a = 1 in equation (16), we immediately obtain the measure D. 

The measure of two support corresponding to the measure of divergence (20) 
is given by the measure of mutual information. For rule E — > H , we have: 



M{E; H) = G{E A H) log + G{E A ^H) log ^ 



G{E)G{H) 



G{^E A H) log log 



G{E)G{^H) 

G{-^E A ^H) 



= - alog 
n ' 



clog- 



G{-^E)G{H) 
an 



G{^E)G{^H) 



(a -I- c) (a -I- b) 
cn 



blog 



bn 



dlog ■ 



(6 -I- d)(a -I- b) 
dn 



{a + c){c+ d) {b + d){c+ d) 

The relationship between J and M can be established as: 
M{E] H) = J{H\\E) + J{H\\-nE). 



(23) 



(24) 



By extending the above relationship, in general one may obtain measures of 
two-way support by combining measures of one-way support. For example, both 
AS{E[\E) + AS{E\E[) and AS{Ed\E)AS{E\E[) are measures of two-way support. 



3.4 Axioms for Quantitative Measures of Rules 

Piatetsky-Shapiro [25] suggested that a quantitative measure of rule E — > H 
may be computed as a function of G{E), G{H), G{E AH), rule complexity, and 
possibly other parameters such as the mutual distribution of E and H or the 
domain size of E and H. For the evaluation of rules, Piatetsky-Shapiro [25] intro- 
duced three axioms. Major and Mangano [17] added a fourth axioms. 
Klosgen [15] studied a special class of measures that are characterized by two 
quantities, the absolute one-way support AS{H\E) and the generality G{E). 
The generality G{H A E) is obtained by AS{H\E)G{E). Suppose Q{E,H) is a 
measure associated with rule E — > H . The version of the four axioms given by 
Klosgen [15] is: 

(i) . Q{E, H) = 0 if if and H are statistically independent, 

(ii) . Q{E,H) monotonically increases in AS{H\E) for fixed G{E), 

(iii) . Q{E,H) monotonically decreases in G{E) for fixed G{E A H), 

(iv) . Q{E,H) monotonically increases in G{E) for fixed AS{H\E) > G{H). 

Axiom (i) implies that only measures of change of support are considered. Other 
axioms states that all measures must have the property of monotonicity. Many 
of the measures discussed in this paper fall into this class. 
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4 Conclusion 

We have presented a simple and unified framework for the study of quantitative 
measures associated with rules. Some basic measures have been proposed and 
studied. Many existing measures have been investigated in terms of these basic 
measures. 

This paper is a preliminary step towards a systematic study on quantitative 
measures associated with rules. Further investigations on the topic are planed. 
We will examine the semantics and implications of various measures, and study 
axioms for distinct types of measures. 
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Abstract. This paper presents some significant fundamental observa- 
tions and/or assumptions on scientific discovery processes and their 
automation, shows why classical mathematical logic, its various classi- 
cal conservative extensions, and traditional (weak) relevant logics can- 
not satisfactorily underlie epistemic processes in scientific discovery, 
and presents a strong relevant logic model of epistemic processes in sci- 
entific discovery. 



1 Introduction 

Any scientific discovery must include an epistemic process to gain knowledge of or to 
ascertain the existence of some empirical and/or logical conditionals previously un- 
known or unrecognized. As an applied and/or technical science, Computer Science 
should provide scientists with some epistemic representation, description, reasoning, 
and computing tools for supporting the scientists to suppose, verify, and then ulti- 
mately discover new conditionals in their research fields. However, no programming 
paradigm in the current computer science focuses its attention on this issue. In order 
to provide scientists with a computational method to program their epistemic proc- 
esses in scientific discovery, we are establishing a novel programming paradigm, 
named "Epistemic Programming", which regards conditionals as the subject of com- 
puting, takes primary epistemic operations as basic operations of computing, and 
regards epistemic processes as the subject of programming. 

Modeling epistemic processes in scientific discovery satisfactorily is an indispen- 
sable step to automating scientific discovery processes. This paper presents some 
significant fundamental observations and/or assumptions, which underlie our research 
direction, on scientific discovery processes and their automation, shows why classical 
mathematical logic, its various classical conservative extensions, and traditional 
(weak) relevant logics cannot satisfactorily underlie epistemic processes in scientific 
discovery, presents a strong relevant logic model of epistemic processes in scientific 
discovery as the logical foundation to underlie epistemic programming. 



N. Zhong and L. Zhou (Eds.): PAKDD'99, LNAI 1574, pp. 489-493, 1999. 
© Springer- Verlag Berlin Heidelberg 1999 




490 Jingde Cheng 



2 Fundamental Observations and/or Assumptions 

First of all, we present here some significant fundamental observations and/or as- 
sumptions, which underlie our research direction, on scientific discovery processes 
and their automation as follows: 

(1) Specific knowledge is the power of a scientist: Any scientist who made a scien- 
tific discovery must have worked in some particular scientific field and more specifi- 
cally on some problem in a particular domain within the field. There is no universal 
scientist who can make scientific discoveries in every field. 

(2) Any scientific discovery has an ordered epistemic process: Any scientific dis- 
covery must have, among other things, a process that consists of a number of ordered 
epistemic activities that may be contributed by many scientists in a long duration. 
Any scientific discovery is nether an event occurs in a moment nor an accumulation of 
disorderly and disorganized inquisitions. 

(3) New conditionals are epistemic goals of any scientific discovery: Any scientific 
discovery process must include an epistemic process to gain knowledge of or to as- 
certain the existence of some empirical and/or logical conditionals previously un- 
known or unrecognized. Finding some new data or some new fact is just an initial 
step in a scientific discovery but not the scientific discovery itself 

(4) Scientific reasoning is indispensable to any scientific discovery: Any discovery 
must be unknown or unrecognized before the completion of discovery process. Rea- 
soning is the sole way to draw new conclusions from some premises that are known 
facts and/or assumed hypothesis. There is no scientific discovery that does not invoke 
scientific reasoning. 

(5) Scientific reasoning must be justified based on some sound logical criterion: 
The most intrinsic difference between discovery and proof is that discovery has no 
explicitly defined target as its goal. Since any epistemic process in any scientific 
discovery has no explicitly defined target, the sole criterion the epistemic process must 
act according to is to reason correct conclusions from the premises. It is logic that can 
underlie valid scientific reasoning. 

(6) Scientific reasoning must be relevant: For any correct argument in scientific 
reasoning as well as our everyday reasoning, the premises of the argument must be in 
some way relevant to the conclusion of that argument, and vice versa. A reasoning 
including some irrelevant arguments cannot be said to be valid in general. 

(7) Scientific reasoning must be ampliative: A scientific reasoning is intrinsically 
different from a scientific proving in that the purpose of reasoning is to find out some 
facts and conditionals previously unknown or unrecognized, while the purpose of 
proving is to find out a justification for some fact previously known or assumed. A 
reasoning in any scientific discovery must be ampliative such that it enlarges or in- 
creases the reasoning agent’s knowledge in some way. 

(8) Scientific reasoning must be paracomplete: Any scientific theory may be in- 
complete in many ways, i.e., for some sentence ‘A’ neither it nor its negation can be 
true in the theory. Therefore, a reasoning in any scientific discovery must be para- 
complete such that it does not reason out a sentence even if it cannot reason out the 
negation of that sentence. 
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(9) Scientific reasoning must be paraconsistent: Any scientific theory may be in- 
consistent in many ways, i.e., it may directly or indirectly include some contradiction 
such that for some sentence ‘A’ both it and its negation can be true together in the 
theory. Therefore, a reasoning in any scientific discovery must be paraconsistent such 
that Ifom a contradiction it does not reason out an arbitrary sentence. 

(10) Epistemic activities in any scientific discovery process are distinguishable: 
Epistemic activities in any scientific discovery process can be distinguished from 
other activities, e.g., experimental activities, as explicitly described thoughts. 

(11) Normal scientific discovery processes are possible: Any scientific discovery 
process can be described and modeled in a normal way, and therefore, it can be simu- 
lated by computer programs automatically. 

(12) Specific knowledge is the power of a program: Even if scientific discovery 
processes can be simulated by computer programs automatically in general, a particu- 
lar computational process which can certainly perform a particular scientific discovery 
must take sufficient knowledge specific to the subject under investigation into ac- 
count. There is no generally organized order of scientific discovery processes that can 
be applied to every problem in every field. 

(13) Any automated scientific discovery process must be valid: Any automated 
process of scientific discovery must be able to assure us of the truth, in the sense of 
not only fact but also conditional, of the final result produced by the process if it starts 
from an epistemic state where all facts, hypotheses, and conditionals are regarded to 
be true and/or valid. 

(14) Any automated scientific discovery process need an autonomous forward rea- 
soning mechanism: Any backward and/or refutation deduction system cannot serve as 
an autonomous reasoning mechanism to form and/or discover some completely new 
things. What we need in automating scientific discovery is an autonomous forward 
reasoning system. 



3 The Fundamental Logic to Underlie Epistemic Processes 

Based on the fundamental observations and/or assumptions presented in Section 2, the 
fundamental logic that can underlie epistemic processes has to satisfy some essential 
requirements. First, as a criterion for validity of reasoning, the logic underlying sci- 
entific reasoning in epistemic processes must take the relevance between the premises 
and conclusion of an argument into account. Second, the logic must be able to under- 
lie paracomplete and paraconsistent reasoning; in particular, the principle of 
Explosion that everything follows from a contradiction cannot be accepted by the 
logic as a valid principle. Third, for any set of facts and conditionals, which are con- 
sidered as true and/or valid, given as premises of a reasoning based on the logic, any 
conditional reasoned out as a conclusion of the reasoning must be true and/or valid in 
the sense of conditional. 

Almost all the logic-based works on modeling epistemic processes are based on 
classical mathematical logic (CML for short) or its some classical conservative exten- 
sions [6], keeping as much as fundamental characteristics of CML. However, CML 
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cannot satisfy all the above three essential requirements for the fundamental logic. 
First, because of the classical account of validity that an argument is valid if and only 
if it is impossible for all its premises to be true while its conclusion is false, a reason- 
ing based on CML may be irrelevant, i.e., the conclusion reasoned out from the prem- 
ises of that reasoning may be irrelevant at all, in the sense of meaning, to the premises. 
Second, CML is of no use for reasoning with inconsistency, since the principle of 
Explosion is a fundamental characteristic of CML. Third, as a result of representing 
the notion of conditional, which is intrinsically intensional, by the extensional notion 
of material implication, CML has a great number of implicational paradoxes as its 
logical axioms or theorems which cannot be regarded as entailments from the view- 
point of scientific reasoning as well as our everyday reasoning. 

Traditional (weak) relevant logics [1, 2] have rejected those implicational para- 
doxes in CML, but still have some ‘conjunction-implicational paradoxes’ and ‘dis- 
junction- implicational paradoxes’ [4] as their logical axioms or theorems, which can- 
not be regarded as entailments from the viewpoint of scientific reasoning as well as 
our everyday reasoning. 

In order to establish a satisfactory logic calculus of conditional to underlie relevant 
reasoning, the present author has proposed some strong relevant logics and shown 
their applications [4, 5]. Since the strong relevant logics are free not only implica- 
tional paradoxes but also conjunction-implicational and disjunction-implicational 
paradoxes, we can use them to model epistemic processes in scientific discovery with- 
out those problems in modeling epistemic processes by CML, various classical con- 
servative extensions of CML, and traditional (weak) relevant logics. 



4 A Strong Relevant Logic Model of Epistemic Processes 

For a gwQM L-theory with premises P, denoted by Tl(P), and any formula ‘A’ of L, A 
is said to be explicitly accepted by Tl(P) if and only if Ag P and — lAg P; A is said to 
be explicitly rejected by Tl(P) if and only if Ag P and — iAg P; A is said to be explic- 
itly inconsistent with Tl(P) if and only if both AgP and — iAgP; A is said to be ex- 
plicitly independent of Tl(P) and is called a explicitly possible new premise for Tl(P) 
if and only if both AgP and — lAgP. For any given formal theory Tl(P) and any for- 
mula AgP, A is said to be implicitly accepted by Tl(P), if and only if AgTl(P) and 
— lAgTL(P); A is said to be implicitly rejected by Tl(P) if and only if AgTL(P) and 
— iAgTl(P); a is said to be implicitly inconsistent with Tl(P) if and only if both 
Ag Tl(P) and — iAg Tl(P); A is said to be implicitly independent of Tl(P) and is called 
a implicitly possible new premise for Tl(P) if and only if both AgTL(P) and 
— lAg Tl(P). 

Let K c F(EcQ), where F(EcQ) is the set of formulas of predicate relevant 
logic EcQ, be a set of sentences to represent known knowledge and/or current beliefs 
of an agent. For any AgTecq(^-.^ where an epistemic deduction of A 

from K, denoted by by the agent is defined as =df ^u{A}; for any 
Ag Tecq(.Q, an explicitly epistemic expansion of K by A, denoted by IC^^, by the 
agent is defined as =df Kkj {A} ; for any Ag K, an explicitly epistemic contraction 
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of K by A, denoted by KT^, by the agent is defined as =df K-{A}-, for any 
Ag Tecq(-Q, an implicitly epistemic expansion of K by A, denoted by is 

defined as =df Tecq(^uA) where AcF(EcQ) such that AiKvjN but 

Ae TecqC-^uAO; for any Ae Tecq(^, an implicitly epistemic contraction of K by A, 
denoted by Tecq(^~^, is defined as Tecq(^~^ =df TEco(Ai-A) where AcF(EcQ) such 
that Ag Tecq(^-A 0; a simple induction by the agent is an epistemic expansion such 
that for 3x(A)g Ai and Vx(A)g Tecq(^, do a simple abduction by the agent is 

an epistemic expansion such that for 'QsK, (A=>B)e Ai, and Ai Tecq(^, do The 
basic properties of these epistemic operations can be found in [5]. 

An epistemic process of an agent is a sequence ATo, Oi, K\, 02 , K 2 , K^-i, o„, 

where AiiCF(EcQ) (n>i>0), called an epistemic state of the epistemic process, is a set 
of sentences to represent known knowledge and/or current beliefs of the agent, and 
Oi+i (n>i>0), is any of primary epistemic operations, and K^+i is the result of applying 
Oi+i to Ki. An epistemic process ATo, Oi, K\, ..., On, is said to be consistent if and 
only if TEcQ(^i) is consistent for any i (n>i>0); an epistemic process K(j, Oj, Ki, ..., 
On, is said to be inconsistent if and only if Tecq/^i) is consistent but TecqC^j) is 
inconsistent for all j>i; an epistemic process Kq, Oi, Ki, ..., On, is said to be para- 
consistent if and only if TEcQ(A/i) is inconsistent but Tecq(Aj) is consistent for some j>i; 
an epistemic process Kq, Oi, K\, ..., On, is said to be monotonic if A/icAj for any i<j; 
an epistemic process A'o, Oj, A/j, ..., On, is said to be nonmonotonic if a for 
some i<j. Any epistemic process Kfj, Oi, K^, o^, including an epistemic contrac- 

tion must be nonmonotonic. 

The idea to model epistemic processes in scientific discovery using relevant logic 
rather than classical mathematical logic was first proposed in 1994 by the present 
author [3], Other work by the author on this direction and a comparison with related 
work can be found in [5]. 

References 

1. Anderson, A. R., Belnap Jr., N. D.: Entailment: The Logic of Relevance and Neces- 
sity. Vol. I. Princeton University Press (1975) 

2. Anderson, A. R., Belnap Jr., N. D., Dunn, J. M.: Entailment: The Logic of Rele- 
vance and Necessity. Vol. II. Princeton University Press (1992) 

3. Cheng, J.: A Relevant Logic Approach to Modeling Epistemic Processes in Scien- 
tific Discovery. Proc. 3rd Pacific Rim International Conference on Artificial Intelli- 
gence. Vol. 1. (1994) 444-450 

4. Cheng, J.: The Fundamental Role of Entailment in Knowledge Representation and 
Reasoning. Journal of Computing and Information. Vol. 2. No. 1 (1996) 853-873 

5. Cheng, J.: A Strong Relevant Logic Model of Epistemic Processes in Scientific 
Discovery. Working Notes of ECAI-98 Workshop on Machine Discovery (1998) 20- 
29 

6. Gardenfors, P., Rott, H.: Belief Revision. In: Gabbay, D. M., Hogger, C. J., Robin- 
son, J. A. (eds.): Handbook of Logic in Artificial Intelligence and Logic Program- 
ming. Vol. 4. Epistemic and Temporal Reasoning. Oxford University Press (1995) 35- 
132 




Discovering Conceptual Differences among 
Different People via Diverse Structures 



Tetsuya Yoshida^, Teruyuki Kondo^, and Shogo Nishida^ 



Dept, of Systems and Human Science 
Grad. School of Eng. Science, Osaka Univ. 

1-3 Machikaneyama-cho, Toyonaka, Osaka 560-8531, Japan 



Abstract. We extend a method for discovering conceptual differences 
among people by introducing diverse structures utilizing Genetic Algo- 
rithm (GA). In general different people seem to have different ways of 
conception and thus can have different concepts even on the same thing. 
Removing conceptual differences seems especially important when people 
with different backgrounds and knowledge carry out collaborative works 
as a group; otherwise they cannot communicate ideas and establish mu- 
tual understanding even on the same thing. In our approach knowledge 
from users is structured into decision trees so that differences in con- 
cepts can be discovered as the differences in the structure of trees. In 
our previous approach ID3 algorithm is utilized to construct a single de- 
cision tree based on the information theory. However, it has a problem 
that conceptual differences which are not represented in the tree due to 
the low information gain cannot be dealt with. To solve this problem, 
this paper proposes a new method for discovering conceptual differences 
which utilizes diverse structures via GA. Experiments were carried out 
on motor diagnosis cases with artificially encoded conceptual differences 
and the result shows the superiority of introducing diverse structures 
with GA to a single decision tree which is constructed with ID3. 



1 Introduction: Discovering Conceptual Difference 

It is required to support collaborative works with the participation of various 
people by extending conventional information processing technologies in accor- 
dance with the need for dealing with large-scale accummulated cases. In addition, 
the importance of facilitating interdisciplinary collaboration among people with 
different backgrounds has been recognized these days. As for supporting collab- 
orative works among people, various researches have been carried out in the field 
of CSCW (Computer Supported Cooperative Work) [1]. 

We aim at supporting mutual understanding among people when they collab- 
oratively work as a group. In this paper we focus on dealing with “Conceptual 
Difference” at the symbol level. Usually different symbols are used to denote 
different concepts, however, the same symbol can be used to denote different 
concepts depending on the viewpoint of people and the context in which the 
symbol is used. In contrast, different symbols can be used to represent the same 
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concept. These can occur especially among people with different backgrounds 
and knowledge. Conceptual differences dealt with in this paper are defined as 
follows: 

— Type 1: different symbols are used to denote the same concept. 

— Type 2: the same symbol is used to denote different concepts. 

We have proposed a system for discovering conceptual differences among people 
based on the cases provided by users [3,6]. In that system concepts held by users 
are structured into a concrete representation of decision trees and the system 
points out the possibility of conceptual differences based on the structure of 
decision tree. By representing the knowledge of each user as a single decision tree, 
the system could discover conceptual differences with high probability, however, 
several problems are found due to the usage of single decision tree for each user. 

IDS algorithm [4] has been utilized to construct decision trees since it is 
fast and thus is suitable for interactive systems. The system architecture which 
incorporates the descovering method is shown in Fig 1. By accepting the cases 
as input the system constructs decision trees for them and tries to discover 
conceptual differences in attributes, values and classes based on the structural 
differences in trees. Since there are 2 types of conceptual differences for 3 entities, 
the system tries to discover 6 kinds of conceptual differences and shows the 
candidates for them in the descending order of the possibility to users. Based 
on the result from the system users discuss each other to change their concepts 
toward reducing conceptual differences and modify input data to the system. The 
above processes are repeated interactively to remove CD gradually. In future we 
plan to extend the system so that it be applicable to more than two users. 



Input File A 



classes 

attributes examples 
values 



Input File B 



classes 

attributes examples 
values 



1 1 Mixture ] 

T T 

^ Decision Tree Producing Module ^ 

I 

T 

Conceptual Difference Delecting Module 
Candidates for Conceptual Difference 



Fig. 1. System Architecture 
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2 Discovery Method Based on Diverse Structures 



IDS algorithm constructs only one fixed structure for an input file. Thus, it 
sometimes occurs that the attributes and the values with conceptual difference 
do not appear in decision trees. Since conceptual difference is discovered based on 
the structure of trees, the system with IDS algorithm cannot discover conceptual 
difference which does not appear in decision trees explicitly. If decision trees with 
diverse structures for an input file are constructed, the possibility of discovering 
conceptual difference increases and thus the above problem might be solved. 
Since GA (Genetic Algorithm) can carry out efficient or convinient search by 
keeping or improving the quality of decision trees [5] , we apply GA to construct 
various types of decision trees and to increase the possibility of finding out the 
appropriate conceptual difference by keeping diverse structures. 

The tree structure is used to represent the genetic information in order to 
reflect the structure of decision trees in our approach. The nodes except the 
leaves in the representation contain the position in the decision tree and the 
attribute to judge the branch in the decision tree. The branch in the genetic 
representation contains the value, which indicates the branch to follow in the 
decision tree. 

In our approach, three operators, crossover, mutation and selection, are used 
in GA. Grossover is used to exchange the partial trees as in Figure 2. The same 
attribute might appear more than once on the path from the root to the leaf as 
the result of crossover. Since the nodes with the same attribute except the first 
one is meaningless, these are removed as in Figure 3. Moreover, leaf nodes with 
no case data are removed since they do not contribute to the classification. 







Fig. 2. Grossover by exchanging partial trees. 
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Fig. 3. Remove redundant nodes on one path 



The set of decision trees which should survive to the next generation needs 
to have the ability to classify examples efficiently and to have diverse structures 
as described above. Therefore, selection is carried out under the following two 
indexes; error rate and mutual distance. When examples are classified into a 
single class efficiently in each leaf, error rate gets larger. Smaller value in this 
index is better. When each structure in one set of decision trees is more diverse, 
mutual distance among decision trees gets larger, and larger value in this index 
is judged as better. Decision trees with smaller error rate survive in the first 
stage, and mutual distance is calculated for each set of decision trees which are 
constructed as the combination of the trees. Finally the set of decision trees with 
the largest mutual distance becomes the initial population in the next generation. 
For further details on our algorithm, please refer to [2]. 



2.1 Experiment and Evaluation 

A prototype system has been developed on the UNIX workstation with C lan- 
guage. As an example, the motor diagnosis case was evaluated. In this case, two 
persons gave their knowledge in the form of thirty examples, which were com- 
posed of six attributes, two or three values and five classes, respectively. The 
artificial conceptual difference, including what could not be discovered by the 
system with IDS algorithm alone, were given to the system to evaluate the abil- 
ity to discover conceptual difference. GA was carried out until the hundredth 
generation. 

Experiments were carried out in the condition that two kinds of conceptual 
differences occured at the same time in the test cases. As the quantitative ealu- 
ation, the number of discovery and its probability of discovery up to the third 
candidate were collected in the experiments both for the system with IDS and 
that with GA. The experiments showed that the candidates which had higher 
possibility of expressing conceptual difference were intensified and that noisy 
candidates were restrained by employing diverse structures. It was confirmed by 
these results that the performance improves by adopting GA as the decision tree 
construction algorithm, is shown in Table 2.1 and Table 2.1. 
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Table 1. Result with IDS. 





number 

of 

trials 


1st 2nd 3rd 


probability 

of 

discovery 


Cl 


20 


20 


0 


0 


100% 


C2 




18 


0 


0 


90% 


Cl 


30 


17 


1 


0 


60% 


A2 




5 


3 


7 


50% 


Cl 


30 


30 


0 


0 


100% 


V2 




52 


0 


0 


87% 


C2 


30 


22 


0 


1 


77% 


VI 




24 


4 


0 


93% 


Al 


30 


12 


13 


3 


93% 


A2 




6 


3 


7 


53% 


Al 


30 


12 


11 


7 


100% 


V2 




45 


2 


6 


88% 



Table 2. Result with GA. 





number 

of 

trials 


1st 2nd 3rd 


probability 

of 

discovery 


Cl 


20 


20 


0 


0 


100% 


C2 




18 


0 


0 


90% 


Cl 


30 


30 


0 


0 


100% 


A2 




19 


6 


2 


90% 


Cl 


30 


30 


0 


0 


100% 


V2 




52 


8 


0 


100% 


C2 


30 


23 


1 


0 


80% 


VI 




30 


0 


0 


100% 


Al 


30 


14 


10 


4 


93% 


A2 




20 


8 


2 


100% 


Al 


30 


20 


9 


1 


100% 


V2 




38 


22 


0 


100% 



3 Conclusion 

This paper has proposed a method to improve the discovery of conceptual differ- 
ences among people from cases by utilizing multiple decision trees with diverse 
structures. Experiments were carried out for our previous system with IDS al- 
gorithm and for the system with the proposed method and the result on motor 
diagnosis cases showed that the peformance on discovery was improved com- 
pared with our previous approach. Further experiments are to be carried out to 
clarify the characteristics of our approach as well as to tune the parameters used 
in Genetic Algorithm toward improving the performance of the system. 
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Abstract. When attempting to discover by learning concepts embed- 
ded in data, it is not uncommon to find that information is missing from 
the data. Such missing information can diminish the confidence on the 
concepts learned from the data. This paper describes a new approach to 
fill missing values in examples provided to a learning algorithm. A deci- 
sion tree is constructed to determine the missing values of each attribute 
by using the information contained in other attributes. Also, an ordering 
for the construction of the decision trees for the attributes is formulated. 
Experimental results on three datasets show that completing the data 
by using decision trees leads to final concepts with less error under dif- 
ferent rates of random missing values. The approach should be suitable 
for domains with strong relations among the attributes, and for which 
improving accuracy is desirable even if computational cost increases. 



1 Introduction 

Machine learning techniques have been successfully employed to extract concepts 
embedded in data describing instances from a particular domain. When the 
instances are described by attributes and propositions on attribute values for 
each instance, this type of learning is called propositional learning. Algorithms 
already exist that manage to build concepts from the above type of data [8,2]. 

One troublesome aspect of data sets used in machine learning is the oc- 
currence of unknown attribute values for some instances in the available data. 
Missing values phenomenon is likely to occur after generating by-products on 
different data collections, which is an operation commonly carried out during 
the process of knowledge discovery [5] . When missing values occur in the data, 
the learning algorithm fails to find an accurate representation of the concept 
(e.g., decision trees or rules). Properly filling missing values in data can help 
in reducing the error rate of the learned concepts. Thus, the purpose of this 
paper is to introduce and evaluate a mechanism to fill missing values in the data 
employed by a propositional learning algorithm. 

This paper is organized as follows. First, the approach for estimating missing 
values is explained, then experimental results on several datasets are discussed, 
followed by a survey of related work. Finally, suitable domains, restrictions, and 
further improvements are discussed. 
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2 Estimating Missing Values 



Having established the need to fill missing values in the data of a particular 
domain, it is advisable to make the most efficient use of information already 
available in the data. That is to say, it seems worthwhile to design a method 
that uses the maximum amount of information derivable from the data, while 
at the same time holding computational demands to a sustainable level. In this 
section, a new method for filling missing values is described in terms of how 
these two requirements have been met. 

In order to fulfill the first requirement, decision trees are constructed for each 
attribute by using a reduced training set with only those examples that have 
known values for the attribute. The reason for using decision trees is that they 
are suitable for representing relations among the most important attributes when 
determining the value of a target attribute. In addition, decision tree learning 
algorithms are fast on formulating accurate concepts. 

After constructing a decision tree for filling the missing values of an attribute, 
it makes sense to use the data with filled values in order to construct a deci- 
sion tree for filling the missing values of other attributes. Therefore, the order 
followed when constructing attribute trees and filling the missing values per at- 
tribute becomes important. The ordering proposed here is based on the concept 
in Information Theory called mutual information, which has been successfully 
used as a criteria for attribute selection in decision tree learning [8] . 

Mutual Information between two ensembles X and Y is defined by: 



H(X-, Y)=J2 log 



P{x) 



- E 



V&Ay 



E P{x\y) log 






P{x\y) 



( 1 ) 



Mutual information measures the average reduction in uncertainty about X 
that results from learning the value of Y, or vice versa. If the attributes and the 
class are assumed as ensembles, then, by measuring mutual information among 
attributes and class, inferences can be done about the strength of the relations 
between them. 

In propositional learning, attributes that have low mutual information with 
respect to the class have less chance to participate in the final concept, so that 
properly filling the missing values for such attributes will have very low impact 
on the accuracy of the final concept. In contrast, attributes having high mutual 
information with respect to the class have a higher chance of being incorporated 
into the final concept, making it worthwhile trying to obtain a finer filling of 
their missing values. 

Considering previous discussion and the requirement of holding computa- 
tional demands to a sustainable level, the ordering proposed in this approach 
can be expressed as follows. Let C to be the class variable. At first, construct 
decision trees and fill the missing values for attributes which have low mutual 
information with respect to C . When constructing the decision tree for any par- 
ticular attribute Ai discard from the training set those attributes Ak for which 
H{A,-C)<H{Ak-,C). 
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3 Experiments and Discussion 

The experimental focus of this paper is to compare the accuracy of decision 
tree learning from data sets whose missing values have been filled by different 
methods. In the experiments, the proposed attribute trees method is compared 
with two other methods used in machine learning for dealing with missing values: 
the majority method [4] and the probabilistic method [3]. 

Ten-fold cross validation experiments were carried out for each of the three 
methods. Each experiment was conducted for rates of artificial missing values 
ranging from 10% to 70%. Artificial missing values were generated in identi- 
cal proportions and following the same distribution for each attribute. Figure 1 
shows a summary of the characteristics of the three datasets used for the evalu- 
ations. 



name 


instances 


attr 


class 


Soybean 


307 


35 cat 


19 


BreastCW 


699 


9 num 


2 


Mushroom 


8124 


21 cat 


2 



Fig. 1. Summary of Datasets 




MISSING VALUE RATE 

Fig. 3. Results on BreastCW 




Fig. 2. Results on Soybean 




MISSING VALUE RATE 



Fig. 4. Results on Mushroom 



When looking at the effects of missing data, the most reasonable assumption 
is that future data will contain the same proportion and kinds of missing values 
as the present data [2]. Accordingly, the experiments conducted in this study 
included artificial missing values in identical proportion and distribution in both 
training and test data. 

Figures 2, 3, and 4 plot the average classification error for concepts learned 
on each of the target domains, depending on each of the three methods. The 
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difference in error performance between the attribute trees method and the other 
two methods was found to be significant at the 95% confidence level for all tested 
rates of missing values in all data sets. These results indicate that when missing 
values occur in both the training and test instances, the attribute trees method 
is superior on modeling the missing values in the three domains tested. 

The worst performance was obtained with the majority method, as was ex- 
pected, since this method can not be used for filling missing values in the test 
data. In contrast, the probabilistic and attribute trees methods are more com- 
plete in the sense that they can deal with missing values in both training and 
test data. 

The probabilistic method constructs a model of the missing values, which de- 
pends only on the prior distribution of the attribute values for each attribute 
being tested in a node of the tree. This approach is adequate when most of the 
attributes are independent, so that the model can rely on the values of each 
attribute, without regard of the other attributes. Attribute trees are more com- 
plete models because they can represent complex relations among the attributes, 
which appear when there is high dependency among the attributes. 



4 Related Work 

In propositional learning, one of the first approaches for dealing with missing 
values was to ignore instances with missing data [8]. This approach was soon 
found to be weak in the sense of not-profiting of useful information present in 
instances with some missing attribute values. Thus, a method that considered 
the most frequent attribute value as a good candidate for filling the missing value 
was proposed and extended later to take the most frequent attribute value for 
the class of the instance that has the missing value [4] .This approach is referred 
as majority method. 

Another approach is to assign all possible values for the attribute, weighted 
by its prior probability estimated from the known distribution of the values of 
the attribute [3]. This approach is referred as probabilistic method, and it was bor- 
rowed for the implementation of C4.5[7]. In fact, Quinlan decided to choose the 
probabilistic approach after extensive experimentation on several domains [6], 
comparing the three methods mentioned above and a fourth method using de- 
cision trees per each attribute. The approach presented in this paper and the 
latest tested by Quinlan have two differences. First, here the attribute trees are 
constructed following an ordering, and second, only the attributes with less mu- 
tual information with respect to the class are taken as input for the construction 
of a tree for a particular attribute. 

On the statistics side of research on decision tree learning, the surrogate 
splits method was formulated by Breiman[2] on his work on binary regression 
trees. This method always keeps secondary attributes to be tested at each node 
of the decision tree when it happens that the value of the primary attribute is 
missing. In fact, this method can be viewed as an specific case of the more general 
approach of using decision trees to fill the missing values of the attributes [6]. 
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5 Concluding Remarks 

A method for obtaining missing values has been proposed and successfully tested 
on several data sets. On the tested domains, the new method is seen to provide 
significantly better performance than the two methods currently used to deal 
with missing values in propositional learning. Domains with high dependency 
among the attributes are thought to be the most suitable for application of the 
approach introduced in this paper. 

All the datasets tested here have discrete values for their attributes. This 
restriction follows from the nature of the decision tree learner used to construct 
the attribute trees. Further experimentation on using a decision tree learner that 
can deal with continuous classes is advisable. 

The increase in computational cost was not evaluated here. Indeed, the ap- 
proach is thought to be suitable for domains for which an increase in computa- 
tional cost is worth to the benefit obtained by lowering the classification error. 
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Abstract. This paper presents an algorithm for discovering prediction 
rules with dynamic bias selection. A prediction rule, which is aimed at 
predicting the class of an unseen example, deserves special attention due 
to its usefulness. However, little attention has been paid to the dynamic 
selection of biases in prediction rule discovery. A dynamic selection of 
biases is useful since it reduces humans’ burden of choosing and adjust- 
ing multiple mining algorithms. In this paper, we propose a novel rule 
discovery algorithm D^BiS, which is based on a data-driven criterion. 
Our approach has been validated using 17 data sets. 



1 Introduction 

Currently, most of the methods for prediction rule discovery have a fixed com- 
bination of biases. A constructive induction algorithm such as [4] is not always 
appropriate in prediction rule discovery, since the applicability of the hypothesis 
is not considered in learning from examples. Morik proposed a multistrategy 
rule discovery system [6], but his goal is characterization of a data set rather 
than the discovery of prediction rules. In order to cope with this problem, we 
propose D^BiS, a rule discovery system based on dynamic bias selection. This 
selection is based on an evaluation criterion of class-attributes dependency. Pre- 
liminary results with 15 data sets from the UCI repository [5] and 2 real-world 
data sets [7] are promising. 

2 Discovery of Prediction Rules 

2.1 Problem Description 

Consider a training data set Dlo with examples each of which is expressed 
with m attributes. First, every continuous attribute, if any, is converted to a 
nominal attribute using a discretization method [1], and we obtain a discretized 
training data set ZJl. An event representing, in propositional form, a single 
value assignment to an attribute will be called an atom. We define a prediction 
rule y ^ X as the production rule of which premise y is represented by a logical 
expression of atoms and conclusion a; is a single atom. 
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In D^BiS, we consider the problem of finding a set of K rules 
R = {ri, T 2 , • • • , rif } with a fixed conclusion x from Dl- In this problem, an 
algorithm is evaluated according to the set of rules it discovers in terms of a 
test data set Dto- A discretized test data set Dt is defined as the output 
of the algorithm which is employed in the discretization of the training data set. 
The proportions of examples that an atom z covers in the discretized training 
data set and the discretized test data set are represented by Pr(z) and Pr(z) 
respectively. 

In evaluating the goodness of a rule set R, we consider two indices: the 
applicability Pr(F) of a rule set, which is the proportion of examples that 
the premises of a rule set R covers in Dt, and the accuracy Pr(a;|y) of a rule 
set, which is the the proportion of the correct predictions for these examples. 
When an example in Z?x is covered by more than two rules in R, we define that 
the rule with the smallest subscript is selected for its prediction. 

J-measure [8] is a single criterion which evaluates both applicability and 
accuracy with various desirable properties. In this paper, we employ J-measure 
Jt{R) of a rule set i? in a discretized test data set Dt as the evaluation criterion 
of R, where x represents the negation of x. 

MR) . FM, Y) log, ^ + pr(x y) log, ^ (1) 

Pr(a;) Pr(a:) 



2.2 Biases in Rule Discovery 



In this paper, a discovery task is viewed as a search problem, in which a node of 
a search tree represents a rule r. We define that the premise of a rule contains 
no atoms in a node of depth 1, and as the depth increases by 1, an atom is 
added to the premise of the rule. The rules which have the K highest values of 
an evaluation criterion are outputted as a rule set. D^BiS employs depth-first 
search until a fixed depth, and beam search for further depth. 

First, for the evaluation criterion, we consider predictiveness and J-measure 
of a rule. Predictiveness of a rule y — > x is defined as the conditional prob- 
ability Pr(x|y) in the discretized training data set. When more than two rules 
have the same predictiveness, they are evaluated according to their Pr(y). J- 
measure [8] ,R{r) of a rule r : y — > x is defined as follows. 



Jl(x) = Pr(x,y)log 2 



/ Pr(x|y) \ 

V / 



-f Pr(x, y) log 2 



Pr(a^|y) 

Pr(x) 



( 2 ) 



Second, for the rule representation, we consider a conjunction rule and an M- 
of-N rule. A conjunction rule is a production rule whose premise is a conjunction 
of atoms. On the other hand, an M-of-N rule is a production rule whose premise 
becomes true when more than M of its N atoms become true. 

Third, for the discretization method, we consider the equal frequency 
method [1] and the minimum entropy method of [2]. The equal frequency method, 
when the number of bins is fc, divides n examples so that each bin contains n/k 
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(possibly duplicated) adjacent values. The minimum entropy method discretizes 
examples so that the class entropy is minimized. In [2], the number of bins is 
automatically determined. 

3 Data-Driven Selection of Biases 

The biases presented in the previous section can be divided into two groups: 
simple biases and complex biases. Predictiveness, an M-of-N rule and the equal 
frequency method belong to simple biases, while other biases are complex biases. 
If the class can be easily predicted, a simple bias would be more effective than 
a complex one. Therefore, we select simple biases when the data set is easy. 

Now the problem is to evaluate the difficulty of a data set. We have modified 
the dependency between the class and the other attributes in COBWEB [3]. 
The following modified dependency Dep' is employed in D^BiS, where Ai 
and Vij represent an attribute and an attribute value respectively. 



First, consider the selection of the evaluation criterion. Generally, predictive- 
ness is superior in finding rules with large Pr(a;|?/), and J-measure of a rule in 
finding rules with large Pr(?/). When the modified dependency Dep' is large, 
predictiveness is considered to be a good criterion since it is relatively easy to 
find rules with large Pr(?/). When the modified dependency is small, predic- 
tiveness tends to find rules each of which premise covers a small number of 
examples. Therefore, D^BiS employs predictiveness when the modified depen- 
dency is smaller than a given evaluation threshold 6*e, and J-measure of a 
rule otherwise. 

Second, consider the selection of the rule representation. With respect to the 
applicability Pr(P) of a rule set, an M-of-N rule usually beats a conjunction rule 
since each rule has large Pr(y). On the other hand, a conjunction rule is superior 
to an M-of-N rule in the accuracy Pr(a;|y) of a rule set since each rule has large 
Pr(x|?/). When the modified dependency Dep' is large, this superiority seems to 
decrease since each rule has large Pr(y). Therefore, D^BiS employs an M-of-N 
rule when the modified dependency is large, and a conjunction rule otherwise. 
For this judgement, we have found that the attribute which has the largest 
dependency to the class plays an important role. We introduce the modified 
dependency Dep{Ai)' for an attribute Ai. 



- Ill 

Dep' = — ^ ^ IT(^, = (Pr(x|A, = - Pr(a.)2 



i=i j 




( 3 ) 



Dep{A,y = ^ I] (Pr(:r|^. = - Pr(x)2 
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+Ft(x\A, = Vijf - 



( 4 ) 



D^BiS employs an M-of-N rule if MAX{dep{Ai)) > 9^,iDep', where is a 
representation threshold 1, or if Dep' is larger than a given representation 
threshold 2 0 r2- Otherwise, a conjunction rule is employed unless predictiveness 
is employed as evaluation criterion. This constraint is due to the^Jact that the 
combination of these 2 biases tends to produce rules each of which Pr(y) is small. 

Third, consider the selection of the discretization method. D^BiS tries the 
equal frequency method with 2 to 10 bins and the minimum entropy method, 
then selects the method with the maximum Dep' . Here, the equal frequency 
method tends to have a large number of bins since it does not consider class 
information. For a similar reason as above, D^BiS always employs the minimum 
entropy method when predictiveness is employed as the evaluation criterion. 



4 Experimental Evaluation 

The effectiveness of D^BiS has been evaluated with 17 data sets from which 47 
discovery tasks were settled. Data sets used in the experiments are: hepatitis, 
servo, wine, imports-85, heart, housing, australian, diabetes, vehicle, tic-tac-toe, 
segmentation, allbp, allhyper, allhypo and mushroom from UCI repository [5]; 
kiosk and horse from [7]. For a data set with a discrete class, we have chosen, 
as a conclusion, every class which covers more than 1% of the whole examples. 
For “housing” and “servo” , their continuous class attributes are first discretized 
with the equal frequency method into 5 bins, and the bins with the largest 
and the smallest values are chosen as a conclusion. Same method was used for 
“imports-85” , where “price” and “city-mpg” attributes are considered as a class 
attribute. 

We employed a combination of depth- first search until depth 3, and beam 
search with beam width 5 for further depth in the experiments. A 5 fold cross 
validation was used in estimating the J-measure of the rule sets. Concerning the 
parameters in section 3, we used K = 5, 9e = 0.04, = 0.9 and 9^2 = 0.008. 

In the experiments, we observed that the combinations of biases selected 
by D^BiS were frequently almost as good as their respective best combination. 
In terms of J-measure of a rule set, 30 combinations selected by D^BiS were 
more than 90 % of their respective best combination. Therefore, D^BiS selected 
the best or a nearly best combination of biases in 2/3 of the cases. 

We investigated this performance of D^BiS in terms of the number of exam- 
ples covered by the conclusion, and the results are shown in table 1. In the table, 
the second column represents the ratio of the J-measure of the rule set discovered 
by D^BiS to the J-measure of the rule set discovered with the best combination. 
The third column represents average performance of biases combinations. 

From the table, D^BiS is as good as the best method, and far outperforms the 
average performance. D^BiS’s performance, unlike the average case, increases 
as the number of examples increases. We attribute this to its analysis of biases 
in terms of the dependency between the class and the other attributes. 
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Table 1. Relative performance Jt{R) of D^BiS and the average case to the 
best case with respect to the number of examples in the conclusion 



of examples (tasks) 


0 - 99 (16) 


100 - 199 (8) 


200 - 299 (5) 


300 - 399 (10) 


500 - (8) 


D'^BiS 


0.79 


0.94 


0.92 


0.92 


1.00 


Average 


0.67 


0.68 


0.65 


0.73 


0.67 



5 Conclusions 

This paper has described a novel approach for discovering prediction rules using 
dynamic selection of biases. The selection is based on the analysis of the depen- 
dency between the class and the other attributes. An evaluation criterion, which 
is a modification of class-attributes dependency [3], is employed for this purpose. 
Experimental results with 47 discovery tasks show that our system is effective 
for the discovery of prediction rules. 
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Abstract. We present a comparison of three entropy-based discretiza- 
tion methods in a context of learning classification rules. We compare 
the binary recursive discretization with a stopping criterion based on 
the Minimum Description Length Principle (MDLP)[3], a non-recursive 
method which simply chooses a number of cut-points with the highest 
entropy gains, and a non-recursive method that selects cut-points accord- 
ing to both information entropy and distribution of potential cut-points 
over the instance space. Our empirical results show that the third method 
gives the best predictive performance among the three methods tested. 



1 Introduction 

Recent work on entropy-based discretization of continuous attributes has pro- 
duced positive results [2, 6] . One promising method is Fayyad and Irani’s binary 
recursive discretization with a stopping criterion based on the Minimum Descrip- 
tion Length Principle (MDLP) [3]. The MDLP method is reported as a successful 
method for discretization in the decision tree learning and Naive-Bayes learn- 
ing environments [2, 6]. However, little research has been done to investigate 
whether the method works well with other rule induction methods. We report 
our performance findings of the MDLP discretization in a context of learning 
classification rules. The learning system we use for experiments is ELEM2 [1], 
which learns classification rules from a set of training data by selecting the 
most relevant attribute-value pairs. We first compare the MDLP method with 
an entropy-based method that simply selects a number of entropy-lowest cut- 
points. The results show that the MDLP method fails to find sufficient useful 
cut-points, especially on small data sets. The experiments also discover that the 
other method tends to select cut-points from a small local area of the entire value 
space, especially on large data sets. To overcome these problems, we introduce a 
new entropy-based discretization method that selects cut-points based on both 
information entropy and distribution of potential cut-points. Our conclusion is 
that MDLP does not give the best results in most tested datasets. The proposed 
method performs better than MDLP in the ELEM2 learning environment. 

2 The MDLP Discretization Method 

Given a set S of instances, an attribute A, and a cut-point T, the class infor- 
mation entropy of the partition induced by T, denoted as E{A, T; 5), is defined 

N. Zhong and L. Zhou (Eds.): PAKDD’99, LNAI 1574, pp. 509-514, 1999. 
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E{A,T-S) = 

where Ent{Si) is the class entropy of the subset 5*, defined as 

k 

Ent{Si) = -Y,P{Cj,Si)log{P{Cj,Si)), 

i=i 

where there are k classes Ci, ■ ■ ■ , Ck and P{Cj, Si) is the proportion of examples 
in Si that have class Cj. For an attribute A, the MDLP method selects a cut 
point Ta for which E[A, Ta', S) is minimal among all the boundary points^. The 
training set is then split into two subsets by the cut point. Subsequent cut points 
are selected by recursively applying the same binary discretization method to 
each of the newly generated subsets until the following condition is achieved: 



G,»n{A.T,S) <= 



MA,T- S) 
N 



where N is the number of examples in 5, Gain(A, T; S) = Ent(S) — E(A, T; S), 
A(A, T; S) = log 2 {‘i^ — 2) — [kEnt[S) — kiEnt(Si) — k 2 Ent(S 2 )], and k, k\ and 
^2 are the number of classes represented in the sets 5, 5i and 52, respectively. 
Empirical results, presented in [3], show that the MDLP stopping criterion leads 
to construction of better decision trees. Dougherty et al. [2] also show that 
a global variant of the MDLP method significantly improved a Naive-Bayes 
classifier and it also performs best among several discretization methods in the 
context of C4.5 decision tree learning. 



3 Experiments with MDLP Discretization and ELEM2 

We conducted experiments with two versions of ELEM2. Both versions employ 
the entropy-based discretization method, but with different stopping criteria. 
One version uses the global variant of the MDLP discretization method, i.e., 
it discretizes continuous attributes using the recursive entropy-based method 
with the MDLP stopping criterion applied before rule induction begins. The 
other version uses the same entropy criterion for selecting cut-points before rule 
induction, but it simply chooses a maximal number of m entropy-lowest cut- 
points without recursive application of the method, m is set to be max{2,k *■ 
log 2 l} where I is the number of distinct observed values for the attribute being 
discretized and k is the number of classes. We refer to this method as Max-m. 
Both versions first sort the examples according to their values of the attribute 
and then evaluate only the boundary points in their search for cut-points. 

We first conduct the experiments on an artificial data set. Each example 
in the data set has two continuous attributes and a symbolic attribute. The 

^ Fayyad and Irani proved that the value Ta that minimizes the class entropy 
E(A,Ta', S) must always be a value between two examples of different classes in 
the sequence of sorted examples. These kinds of values are called boundary points. 
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Training 
Set Size 


Predictive accuracy | 


No. of cut-points| 


No. of rules 


No. of Boun- 
dary Points 


MDLP 


Max-m 


MDLP 


Max-m 


MDLP 


Max-m 


47 


56.71% 


95.20% 


0 


14 


3 


6 


58 


188 


90.41% 


100% 


2 


21 


4 


6 


96 


470 


100% 


100% 


5 


22 


6 


6 


97 


1877 


100% 


100% 


29 


22 


6 


6 


97 


4692 


100% 


100% 


73 


22 


6 


6 


97 



Table 1. Results on the Artificial Domain. 

two continuous attributes, named A1 and A2, have value ranges of [0,90] and 
[0,5], respectively. The symbolic attribute, color, takes one of the four values: 
red, blue, yellow and green. An example belongs to class “1”, if the following 
condition holds: (30 < A1 < 60) A (1.5 < A2 < 3.5) A {color = blue or green); 
otherwise, it belongs to class “0”. The data set has a total of 9384 examples. We 
randomly chose 6 training sets from these examples. The sizes of the training sets 
range from 47 examples (0.5%) to 4692 examples (50%). We run the two versions 
of ELEM2 on each of the 6 training sets to generate a set of decision rules. The 
rules are then tested on the original data set of 9384 examples. Table 1 depicts, 
for all the training sets, the predictive accuracy, the total number of cut-points 
selected for both continuous attributes, the total number of rules generated for 
both classes, and the number of boundary points for both continuous attributes. 
The results indicate that, when the number of training examples is small, the 
MDLP method stops too early and fails to find enough useful cut-points, which 
causes ELEM2 to generate rules that have poor predictive performance on the 
testing set. When the size of the training set increases, MDLP generates more 
cut-points and its predictive performance improves. For the middle-sized training 
set (470 examples), MDLP works perfectly because it finds only 5 cut-points 
from 97 boundary points, which include all of the four right cut-points that the 
learning system needs to generate correct rules. However, when the training set 
becomes larger, the number of cut-points MDLP finds increases greatly. In the 
last training set (4692 examples), it selects 73 cut-points out of 97 potential 
points, which slows down the learning system. In contrast, the Max-m method 
is more stable. The number of cut-points it produces ranges from 14 to 22 and its 
predictive performance is better than MDLP when the training set is small. We 
also run the two versions of ELEM2 on a number of actual data sets obtained 
from the UCI repository [4], each of which has at least one continuous attribute. 
Table 2 reports the ten-fold evaluation results on 6 of these data sets. 

4 Discussion 

The empirical results presented above indicate that MDLP is not superior to 
Max-m in most of tested data sets. One possible reason is that, when the training 
set is small, the examples are not sufficient to make the MDLP criterion valid 
and meaningful so that the criterion causes the discretization process to stop too 
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Data 

Set 


Number of 
Examples 


[Predictive accuracy j 


Average no. of rules | 


MDLP 


Max-m 


MDLP 


Max-m 


bupa 


345 


57.65% 


66.93% 


4 


65 


german 


1000 


68.30% 


68.50% 


107 


100 


glass 


214 


63.14% 


68.23% 


31 


30 


heart 


270 


81.85% 


82.59% 


48 


30 


iris 


150 


95.33% 


96.67% 


8 


7 


segment 


2310 


95.76% 


90.65% 


67 


99 



Table 2. Results on the Actual Data Sets. 

early before producing useful cut-points. Another possible reason is that, even 
if the recursive MDLP method is applied to the entire instance space to find 
the first cut-point, it is applied “locally” in finding subsequent cut-points due to 
the recursive nature of the method. Local regions represent smaller samples of 
the instance space and the estimation based on small samples using the MDLP 
criterion may not be reliable. 

Now that MDLP does not seem to be a good discretization method for 
ELEM2, is Max-m a reliable method? A close examination of the cut-points pro- 
duced by Max-m for the segment data set uncovers that, for several attributes, 
the selected cut-points concentrate on a small local area of the entire value space. 
For example, for an attribute that ranges from 0 to 43.33, Max-m picks up 64 
cut-points all of which fall between 0.44 and 4, even if there are many boundary 
cut-points lying out of this small area. This problem is caused by the way the 
Max-m method selects cut-points. Max-m first selects the cut-point that has 
the lowest entropy value and then selects as the next point the point with the 
second lowest entropy, and so on. This strategy may result in a large number of 
cut-points being selected near the first cut-point because their entropy values are 
closer to the entropy value of the first cut-point than the entropy values of the 
cut-points located far from the first cut-point. The cut-points located on a small 
area around the first cut-point offer very little additional discriminating power 
because the difference between them and the first cut-point involves only a few 
examples. In addition, since only the first m cut-pints are selected by Max-m, 
selecting too many cut-points in a small area may prohibit the algorithm from 
choosing the promising points in other regions. 



5 A Revised Max-m Discretization Method 

To overcome the weakness of the Max-m method, we propose a new entropy- 
based discretization method by revising Max-m. The new method avoids select- 
ing cut-points within only one or two small areas. The new method chooses cut- 
points according to both information entropy and the distribution of boundary 
points over the instance space. The method is referred to as EDA-DB (Entropy- 
based Discretization According to Distribution of Boundary points). Similar to 
Max-m, EDA-DB selects a maximal number of m cut-points, where m is defined 
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as in the Max-m method. However, rather than taking the first m entropy-lowest 
cut-points, EDA-DB divides the value range of the attribute into intervals and 
selects in each interval m*- number of cut-points based on the entropy calculated 
over the entire instance space, m*- is determined by estimating the probability 
distribution of the boundary points over the instance space. The EDA-DB dis- 
cretization algorithm is described as follows. Let I be the number of distinct 
observed values for a continuous attribute A, b be the total number of boundary 
points for A, and k be the number of classes in the data set. To discretize A, 

1. Calculate m as max{2,k * log 2 {l)}. 

2. Estimate the probability distribution of boundary points: 

(a) Divide the value range of A into d intervals, where d = max{l,log{l)}. 

(b) Calculate the number b; of boundary points in each interval ivi, where i = 

1, 2, • • • , d and b; = b. 

(c) Estimate the probability of boundary points in each interval ivi {i = 1, 2, • • • , d) 

i>i 

as Pi = -f 

3. Calculate the quota g; of cut-points for each interval ivi (i = 1, 2, • • • , d) according 
to m and the distribution of boundary points as follows: g; = pi * m 

4. Rank the boundary points in each interval ivi (i = 1, 2, • • • , d) by increasing order 
of the class information entropy of the partition induced by the boundary point. 
The entropy for each point is calculated globally over the entire instance space. 

5. For each interval ivi {i = 1, 2, • • • , d), select the first g; points in the above ordered 
sequence. A total of m cut-points are selected. 

6 Experiments with EDA-DB 

We conducted experiments with EDA-DB coupled with ELEM2. We first con- 
ducted ten-fold evaluation on the segment data set to see whether EDA-DB im- 
proves over Max-m on this data set which has a large number of boundary points 
for several attributes. The result is that the predictive accuracy is increased to 
95.11% and the average number of rules drops to 69. Figure 1 shows the ten-fold 
evaluation results on 14 UCI data sets. In the figure, the solid line represents the 
difference between EDA-DB’s predictive accuracy and Max-m’s, and the dashed 
line represents the accuracy difference between EDA-DB and MDLP. The re- 
sults indicate that EDA-DB outperforms both Max-m and MDLP on most of 
the tested data sets. 

7 Conclusions 

We have presented an empirical comparison of three entropy-based discretization 
methods in a context of learning decision rules. We found that the MDLP method 
stops two early when the number of training examples is small and thus it fails 
to detect sufficient cut-points on small data sets. Our empirical results also 
indicate that Max-m and EDA-DB are better discretization methods for ELEM2 
on most of the tested data sets. We conjecture that the recursive nature of the 
MDLP method may cause most of the cut-points to be selected based on small 
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Fig. 1. Ten-fold Evaluation Results on Actual Data Sets. 

samples of the instance space, which leads to generation of unreliable cut-points. 
The experiment with Max-m on the segment data set reveals that the strategy 
of simply selecting the first m entropy-lowest cut-points does not work well 
on large data sets with large number of boundary points. The reason for this 
is that entropy-lowest cut-points tend to concentrate on a small region of the 
instance space, which leads to the algorithm failing to pick up useful cut-points 
in other regions. Our proposed EDA-DB method alleviates the Max-m’s problem 
by considering the distribution of boundary points over the instance space. Our 
test of EDA-DB on the segment data set shows that EDA-DB improves over 
Max-m on both the predictive accuracy and the number of rules generated. The 
experiments with EDA-DB on other tested data sets also confirm that EDA-DB 
is a better method than both Max-m and MDLP. 
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Abstract. Data mining is the discovery of previously unknown or hid- 
den and potentially useful knowledge in databases. In this paper, we 
present an algorithm, called BRRA, that mines relationships in a data- 
base in order to derive compact rules set. This algorithm is based on a 
mathematical concept called relevant rectangles representing full asso- 
ciation between a set of i arguments and a set of j images in a binary 
relation. 

Keywords: Knowledge discovery algorithm, relevant rectangle, infer- 
ence rules 



1 Introduction 

Over the past two decades there has been a huge increase in the amount of data 
being stored in databases. Knowledge discovery (KD) aiming to delve in this 
data that has been largely ignored by bringing to surface previously unknown, 
potentially useful and hidden knowledge in a database. 

In this paper, we propose an algorithm, called BRRA: Based relevant rectan- 
gles algorithm, that mines relationships in database in order to derive rules. The 
derived rules are a description of the dependencies between a target attribute and 
a conjunction of the condition attributes. In order to have an efficient use, the 
derived rule set must be compact (i.e., minimal cardinality of rules and minimal 
number of condition attributes in each rule). 

This paper is organized as follows. In section 2, we present the mathematical 
background of the rectangle concept. Section 3, is devoted to the presentation of 
the KD algorithm in order to derive compact rules set. Section 4, presents the 
evaluation model assessing the compactness of the derived rule set and discusses 
the computational complexity of the KD algorithm. Finally, section 5 concludes 
this paper and points out future issues. 

2 Basic Definitions 

In our proposal, we use the concept of relevant rectangle (RR). A RR reflects 
interesting semantic and structural properties. Roughly speaking, a ’’rectangle”. 
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of a binary relation R, is a couple of two sets (A, B) such that A x B C R. 
Searching for RRs of a finite binary relation is a problem which has been pre- 
viously studied by pure mathematicians within the framework of lattice theory 
and which has been later proved relevant in several practical fields of computer 
science [2]. 

Definition 1. A binary relation R on a set E in a set F, is a subset of the 
Cartesian product E x F. We define the domain of R, denoted dom(R), as 
dom{R) = {a; I 3y : {x^y) S R}, and the range of R, denoted ran(R), as 
ran{R) = {y \ 3x : (x,y) S R}. 

Definition 2. Let R be a binary relation defined on E in F. A rectangle of R is 
a couple of two sets such that A Q E, B G F and A x B C R where A denotes 
dom(R) and B ran(R). 

Definition 3. Let R be a binary relation defined on E in F. A rectangle (A, B) 
of R is said maximal if and only if: Ax B G W x B^ G R ^ A = Wand B = B\ 

Example 1. Let us consider the binary relation defined on E in F where E={1, 
2, 3, 4} and F={A, B, C}. The rectangle REl from R given in figure 1 is said 
to be maximal where the rectangle RE2 is not maximal since RE2 G REl. 




Fig. 1. Example of maximal rectangle. 



In the following, we present an heuristic to choose the RR between different 
maximal ones. Let us consider (i?Ei, RE 2 , . . . , REp), p maximal rectangles of a 
binary relation R. The rectangle RE is said to be relevant if and only if it verifies 
the following condition: 

\dom{RE)\ = maxi=i^..._p|dom(i?Ei|. In case of equality of the cardinality of 
two or more rectangles domain, then the rectangle RE is elected as relevant if 
\ran{RE)\ = mini=i_...^p|ran(i?£’i)|. 

Example 2. In figure 2, given that |dom(i?£’i)| = \dom{RE 2 )\, the rectangle 
REl is elected as relevant, since \Ran{REi)\ < \Ran{RE 2 )\. 
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REi RE2 

Fig. 2. Example of relevant rectangle. 

3 KD Algorithm 

In order of easy data characterization by means of rules and avoiding com- 
plex rules derivation, one must diminish the number of distinct values in each 
attribute by performing a conceptual abstraction (or generalization) on the 
database. This abstraction step ascends specific values to higher-level concepts. 
For example, suppose that the domain of the attribute ’’Age” is the range [0 — 60]. 
Therefore, the distinct age values ranging from 0 — 5 and from 6 — 12 can be 
replaced by the user-defined concepts ’’infant” and ’’youngster”, respectively. 

Since the KD algorithm is based on the research of RRs, the n-ary generalized 
relation is then transformed into a binary relation called BR. BR is defined 
between a set of tuple numbers (N) and a set of properties P, as a subset of the 
Cartesian product NxP, by: BR(N, P)= 1, if the tuple has the property ’P’; 
BR(N, P)= 0 otherwise. 

The binary relation BR is the entry to the KD algorithm given in the follow- 
ing. 

KD algorithm 

Input: A binary relation BR and a target attribute. 

Output: A set of compact If-Then rules. 

Begin 

1 . For all distinct values of the target attribute (TA) Do 

2. Decomposition step: the original base BR is decomposed in two sub-bases 
Biand Bq. Bi contains the tuples such that ti.TA=l, and Bg contains the 
tuples such that ti.TA=0, i = 1, . . . , n, where n is the number of tuples. 

3. Generation step: Generate the RRs covering the tuples of the sub-base 
Bi. Theoretically, each generated RR corresponds to a rule of the form ”If X 
is A then Y is B”. Where X is a set of attributes Xj, i = 1, . . . , fc, and A 
is the set of its associated values Vj, i = 1, . . . , fc. Y is the target attribute 
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and B is its associated value. The semantics of such kind of rules is when ’X 
is A’ is satisfied, we can imply that ’Y is B’ is also satisfied. At this step, All 
the generated rules are stated as possible rules until the verification step is 
performed. 

4. Verification step: This step consists in checking if it exists counter- 
examples to the generated rules. In fact, we consider the condition attributes 
of each rule (Ai, A 2 , . . . ,Ak) and check in Bq if it exists a tuple tj such that 
{tj.Ai = 1 and tj.A 2 = 1 and . . . and tj.Ak = 1), then this rule is discarded. 



End. 

A RR, represents the full association between a set of i arguments and a set 
of j images in a binary relation. Hence, we assume that the properties of the range 
of a RR constitute the rule condition part and the conclusion part is constituted 
by the associated TA value. For example, the following RR: {l,2,5,6}x{DL, C6} 
can be rewritten as the following If-Then rule: ”If Displace = Large and Cyl= 6 
then Cost = Expensive”, where Cost is the target attribute. The domain of the 
RR indicates the set of tuples covered by this rule. 

Example 3. In the following table, we present the generalized base GBC storing 
cars characteristics yielded by the step of abstraction[5]. 



N 


Displace 


Fuelcap 


Mass 


Speed 


Cyl 


Cost 


1 


large 


high 


medium 


medium 


6 


expensive 


2 


large 


low 


heavy 


fast 


6 


expensive 


3 


medium 


medium 


light 


fast 


6 


medium 


4 


small 


low 


light 


slow 


6 


cheap 


5 


large 


medium 


medium 


medium 


6 


expensive 


6 


large 


medium 


light 


medium 


6 


expensive 


7 


small 


low 


light 


medium 


6 


cheap 


8 


small 


medium 


light 


slow 


4 


cheap 


9 


medium 


low 


medium 


medium 


6 


medium 


10 


medium 


high 


medium 


fast 


6 


expensive 


11 


medium 


high 


light 


fast 


6 


expensive 


12 


small 


high 


heavy 


medium 


4 


expensive 


13 


small 


high 


light 


medium 


4 


cheap 


14 


medium 


low 


heavy 


medium 


4 


expensive 


15 


medium 


medium 


medium 


medium 


4 


medium 


16 


medium 


high 


medium 


medium 


4 


medium 


17 


small 


medium 


medium 


fast 


4 


medium 


18 


medium 


medium 


heavy 


slow 


4 


expensive 



Choosing the attribute ” Cost” as a target attribute and applying the BRRA 
algorithm to the binary relation derived from the relation GBC, the following 
compact If-Then rule set is derived (with indication of the set of the covered 
tuples by each rule): 
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Rl: If Displace = Large and Cyl= 6 then Cost = Expensive {1, 2, 5, 6}, 
R2: If Mass = Heavy then Cost = Expensive {2, 12, 14, 18}, 

R3: If Fuelcap = High and Cyl = 6 then Cost = Expensive {1, 10, 11}, 
R4: If Mass = Medium and Cyl = 4 then Cost= Medium {15, 16, 17}, 
R5: If Displace = Medium and Cyl = 6 then Cost=Medium {3,9}, 

R6: If Displace = Small and Mass = Light then Cost=Cheap {4, 7, 8, 13}. 



4 Evaluation Model 



Yen et al [5] proposed an evaluation model in order to evaluate the compactness 
of a rule set. This model, denoted E, the larger is, the more compact the rule 

r 

set is, and is equal to E= x ^), where : r: the cardinality of the rule 

set; t^: the number of tuples covered by rule i in the rule set; n.;: the number 
of tuples in which the target attribute value involved in the consequent of the 
rule i appears; C^: the number of the condition attributes in the antecedent of 
rule i,l < i < r. 

A comparative table between BRRA and other approaches is given in the 
following table. 



Approach 


Rule set cardinality 


E 


BRRA 


6 


0.31 


LCR[5] 


6 


0.33 


ID3 [3] 


9 


0.172 


Cai et al [4] 


18 


0.033 



In terms of data access, the complexity of the BRRA is linear and needs to 
scan the base only once which is not the case for other approaches[l,6] where 
we may need to scan the relation repeatedly. It is easy to show that the compu- 
tational complexity is 0(m x |i?|) where m = number of attributes and |i?| = 
relation cardinality, given that, the complexity of the second step is equal to 
min(n,m)^ x max(n,m). 



5 Conclusion 



In this paper, we have proposed a new algorithm, based on the discovery of the 
RRs, to mine relationships or dependencies between attributes in database. The 
presented algorithm is under implementation and the issue of the paralellization 
of the generation step (the most costly one) is under study. Another issue to 
which the present work shall be extended is the handling of uncertain data. 
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Abstract. This paper defines a kind of rule, functional dependency rule. The 
functional dependency degree of relational database can be depicted by this 
kind of rule. We give a algorithm to mine this kind of rule and prove some 
theorem to ensure the high efficiency and the correction of the algorithm. At 
last, we point some experiment result to support our conclusion. 



1 Introduction 

Now most of real-world data are stored in relational databases, such as bank data, 
marketing data, etc. So it is significant to research the mining technology oriented 
relational database. 

In relational database theory, the functional dependency relationship (called FD in 
the following) among relation’s attributes is a very important and basic concept. We 
usually confirm that the function dependency relations when designing the database 
tables. But frequently we just get the data of table and don’t know the relationship 
among attributes. And some times there are some FDs beyond our consciousness. So 
we need sophisticated methods to find the FDs from the data of a table. 

The remainder of this paper is organized as follows. In section 2, we first give the 
definition of functional dependency rule (It is called FDR in the following) of 
relations. Then we give some extension definitions and prove some theorems so as to 
design and analyze mining algorithm in section 3. In the last section, the conclusions 
and future work are presented 



2 The Definition of FDR 

There is noise in the practice data, so finding absolute FD is meaningless. We need 
methods to evaluate the FD degree, which is called FDR, among attributes of a table. 
Definition 1 Data distribution is a function, its parameters are table and table’s 
attribute set, its value is a set, whose element is a 2-tuple. In each 2-tuple, the first part 
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is the value of attribute set, the second part is the number of the value appearing in the 
table. We denote function, data distribution, as dd(T, A)={(Vj, nj), (Vj, nj, ... (v^, n^)). 
Definition 2 Value Distribution is a function, its parameters are the same as the 
function of definition 1, and its value is a set of distinct value of the appointed 
attributes. We denote this function as ddval (T, A)={v,, v^,..., v^}. 

Definition 3 Number Distribution is a function, its parameters the same as the 
function of definition 1 , its value is a set of natural numbers which record the times 
each value of attributes appearing in table. We denote this function as ddnumjT, 
A}={n„n„ ..., n^}. 

Definition 4 This function computes the times of one special value of attributes. We 
denote it as getnum(T,A=Vj, B)=n. 

The value of function ddval is composed of the first part of the value of function 
dd. The value of function ddnum is composed of the second part of value of function 
dd. The value of function getnum is the number whose value equals the appointed 
value. 

We add the condition to extend the function dd, cdd(T, C, B)= dd(T’, B), the T’ is 
composed with the tuples which accord with the condition C. The extension is the 
same to the other two functions. 

We suppose A, B are table’s attributes, the total number of table’s tuples is n. The 
data distribution of attribute A is {(a,, Uj), nj, ..., (a^, Up)}, that means a^ existing 
in attribute A a times. The max number of each data distribution of attribute B 
corresponding to every value of attribute A (Uj, a^, ..., a^) is mj, m^, ..., m^. Obviously, 

p 

the expression, ^ n. = n, is established. 

(=1 

The string 'fdd “ stand for the functional dependency agree in the following part 

p p p 

^m. '^m. max(cddnum(T , A =a.,B) 

Definition 5 fdd (A, B) = — = — = — 

’’ n n 

i=\ 

w. is the weight of different value of attribute A, usually the greater the number of 
the value is ,the greater the Wj is. 

Definition 6 Given confindence threshold is mine, iffdd(A, B)>=minc, then we say 
there exists the functional dependency rule (FDR), denoted or B=f(A) (For 
distingishing with the concept of FD in relational database theory, not using the mark, 
A->B). The A is called determining term, B depending term. 

Remark 1 : According to the definition 6, we can easily get the conclusion: fdd((A, 
B), A)=l and fdd(A, (A, B))=fdd(A, B), so it is no significance when the intersection 
set of determining term and depending term is not empty. In this paper, we just 
discuss the rule whose intersection set of determining term and depending term is 
empty. 

Remark 2: Symbol A and B express attribute set, such as {Aj, A^, ...,AJ. In this 
paper, we can omit the set symbol ‘ { ‘ and ‘ } ’ for writting conveniently. For 
example, .we write (B, C)=f(A) instead of ({B, C})=f({A}), A=f(B, C) instead of 
{A}=f({B,C}). 
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3 The Algorithm of Mining Function Dependency Rule 

3.1 The Related Theorems 

Theorem 1 If FDR, B=f(A), is established, then the FDR B=f(A, C) is established. 
Proof: We just need to proof fdd({A, C), B)>=fdd(A, B). 

Suppose dd(T, A)={(aj, aUj), (a^. an^), ..., (a^, aUp)}, dd(T, {A, C))={(ac,, acUj), (ac^. 
acn^), ..., (aCp, acUp). Obviously, the data distribution of A and C is a partition of data 
distribution of A. Suppose (aCj, acUj), (ac^. acnj, ..., (ac^, acn^) is the partition of (Uj, 
aUj), then 

XI cdd(T,A= a.,B) =XXI cdd(T, (A = a.,C = Cj ),B) 

i i J 

msLx(cdd(T,A = a^,B)) <=^max(cdd(T,(A =a^,C= Cj),B)) 



max{cdd (T ,(A - aj,C- c 

fdd{{A,C],B)= >= 

n 

^ max{ cdd (T,A = a^,B)) 

— -fdd(A,B) >= min c 

n 

So, rule B=f(A, C) is established. 

Corollary 1 If B=f(A), then B=f(A, C, ...) 

Theorem 2 If B=f(A), is not established, then the {B, C}=f(A) is not established. 
Proof: We just need to proof fdd(A, {B, C})<=fdd(A, B). 

Because the data distribute of (B, C) according to every value of attribute A is a 
partition of the data distribute of B, the expression below is established. 

max(cdd(T,A = a.,(B,C))) < =max(cdd(T,A = a.,B)) 

'^max(cdd(T,A= a,,(5,C))) 

fdd(A,{B,C}) ^ <= 

n 

^max(cJJ(r,A =a^,B)) 

— = fdd {A, B) < min c 

n 

So, rule (B, C)=f(A) is not established. 

Corollary 2 If not B=f(A), then not (B, C, ...)=f(A). 

Corollary 3 If (B, C)=f(A), then B=f(A) and C=f(A). 
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Definition 7 One FDR is called basic FDR, if any one attribute is deleted from its 
determining term or any one attribute is added into its depending term, then the rule is 
not established. 

Theorem 3 All the FDRs can be deduced from basic FDRs by the Theorem 1 and 
Theorem 2. 

Proof: We use mathematics induction to proof this theorem. 

1) The induction step: If the rule is not basic FDR. 

Supposing the rule’s determining term has n attributes, the rule’s depending 
term has m attributes, then it can be denoted to C=f(A, B). A is a attribute. B and 
C are attribute sets with n-1 elements and m elements specially. 

Supposing the rules, whose determining term has n-1 attributes and depending 
term has m-tl attributes, can be deduced from basic FDR. 

By theorem 1, (C, D)=f(B) => C=f(A, B), D is a attribute. 

So the conclusion is established. 

2) The induction base: If the rule is basic FDR, then it is clear that the conclusion is 
established. 



3.2 The Idea of Algorithm 

We use a DAG figure to organize all of the rales. Each rule is one node in the figure. 
If there are two rules, rl and r2, r2 is the directly reducing rule of rl, then there is a 
direct line from rl to r2 in the figure., the node of rl is called father of node of r2. The 
task of finding all the basic FDRs is to search all the nodes who is establish, but 
whose father is not established. 

The root of the DAG is the searching starting point. If the father is established, this 
searching branch is wiped out, else searching all the sons. Obviously, all the rales 
found by this method are basic FDR, and all the FDRs will be deduced from the basic 
FDR by the Theorem 3. 



3.3 Mining All the FDR in Relational Database Table. 

Algorithm Mining all the basic FDR in relational database table. 

Input: Table T={Aj, A^, ..., A_^}, confidence threshold mine. 

Output: Table fdrTab=(ruleID, DeterTerm, DepenTermr) 

Data Structure: Queue qWaitJudgeRules, Set sFDR, sNonFDR 

1) variable initializing. fdrTab=empty, adding rales T=f(NULL) into sNonFDR 

2) cycle, until sNonFDR is empty 

2.1) producing qWaitJudgeRules by the directly reducing rules of sNonFDR taking 
out the directly reducing rules of sFDR, 

2.2 ) inserting all the rales of sFDR into fdrTab 
2.3) cycle, until qWaitJudgeRules is empty 

2.3.1) take the head rale of qWaitJudgeRules, called r 

2.3.2) if the rule r is established, 

then insert rale r into table sFDR 
else insert rule r into table sNonFDR 

3) output the result 
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3.4 The Analysis of Time Complexity 

There are two layer cycle, the first cycle’s max length is the number of table’s 
attributes m-1, the second cycle’s max length is a function of the time of the first layer 
cycle. If the time of the first layer cycle is i (l<=i<=m-l), then the second cycle’s max 

length is - 1) , so the time complexity of algorithm 5 is 

m— 1 

Xc:(2”-'-l)n=C»((1.5)'"n). 

i=\ 

If we use the most hrutal method, the pure enumerating method. All the terms are 
the element of the power set of attributes of table. If table has m attributes, then the 

power set’s cardinal number is 2”* . If we judge the rule between two terms one by 
one, and there are two directions. The time complexity is 

0(2™ (2™ = 0((2^™ — 2™)n) = 0(2^™ n) . So our algorithm is much 

faster than the naive algorithm. 

There is a symmetrical algorithm of the above algorithm. We could define the 
negative concept of reducing relation of two rule, that is called denying relation. Then 
we organize the DAG by denying relation. The symmetrical algorithm searches the 
DAG from the root to find all the nodes that can not be denied. The result is the 
deepest nodes that are established. If the confidence value, mine, is very big, the 
symmetrical algorithm is faster than the above algorithm. 



4 Discussion and Conclusion 

This paper puts forward the definition of functional dependency rule of relational 
table and gives the efficient algorithm for mining the rules. Our experimental results 
prove our argumentation. 

At last, it should be indicated that the relation between our FDR and the famous 
association rule is very close. The FDR can be regarded as a summery of a group of 
association rules. Further, we can use association rule to define and mine FDR. 
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Abstract. A growing attention has been paid to mining time-series 
knowledge, while time-series prediction becomes one of the important 
aspects of data mining and knowledge discovery (DMKD). This paper 
presents a new mechanism of time-series prediction with cloud models. 
This mechanism not only synthesizes different predictive knowledge 
with different granularities, but also combines two kinds of predictive 
strategy: local prediction and overall prediction. We focus this paper on 
the application of cloud models to transform between quantitative and 
qualitative knowledge, synthesize different kinds of knowledge and 
realize the soft inference. 



1 Introduction 

Data mining and knowledge discovery (DMKD) has become an active and growing 
research area. And recently, mining qualitative predictive knowledge for time-series 
prediction has been listed as one of the challenges for DMKD. Time-series prediction 
is to forecast future values from a temporal sequence of data. Such prediction is 
required in many fields, such as the prediction of weather, network load, future sales, 
stock market and so on. 

There are many kinds of time-series predictive approaches. The most common 
“classical” approaches include: Autoregressive(AR), Moving Average(MA) and 
combined approaches: ARM A, ARIMA, and ARCH. Neural network models are an 
alternative to the classical methods [4, 7]. Recently, some interesting work has also 
been done on applying machine learning to temporal domains [6, 8]. Although these 
approaches all have their own advantages, there exist common shortcomings among 
them. Firstly, these methods usually give one or a sequence of numbers as predictive 
results. The quantitative results are difficult for the users to understand and usually 
crisp, while qualitative knowledge may be more robust and easier to understand, but 
difficult to express and calculate in computers. So representation of predictive 
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knowledge is a challenging and inescapable subject. Secondly, in most of these 
methods, little attention has been paid to the time granularity, although it is a very 
important element as we analyze the time-series data. 

In the rest paper, a new mechanism of time-series prediction based on cloud 
models will be proposed. Representation of predictive knowledge with cloud models 
is presented in section 2. Then, section 3 expounds the new mechanism of time-series 
prediction, and section 4 gives the results of experiment and the conclusion. 

2 Representation of Predictive Knowledge with Clond Models 

Suppose we have a set of time-series data D: {(ai,bi) | 0<i<t}, meanwhile a-, is a value 
of time attribute A, and bi is a value of numerical attribute B. The task of prediction is 
to forecast the value bt of the future time at. To complete the predictive task, the first 
problem we meet with is the representation of predictive knowledge. Usually people 
represent their predictive knowledge as linguistic rules such as, if it is the mid-season 
then the sales may be high, which include uncertain concepts “mid-season” and 
“high”, as well as uncertain inference “if... then ...may...”. So predictive knowledge 
is a kind of qualitative knowledge, which is always indeterminate and uncertain. In 
this section, a randomized method — cloud model will be introduced to represent the 
predictive knowledge. 

Let U be the set U={u}, as the universe of discourse, and T a term associated 
withU. The membership degree of u in U to the term T, Cx(u)e[0,l], is a random 
number with a stable tendency. The cloud of T is a mapping from the universe of 
discourse U to the unit interval [0,1]. That is: 

Ct(u):U-^[0,1] VugU u^Ct(u) (1) 

The normal cloud (NC) is based on normal distributions. It can be characterized by 
three digital parameters: A(Ex, En, He). The expected value Ex points out the center 
of gravity of a cloud. The entropy En is a measure of the fuzziness of the concept over 
the universe of discourse, showing how many elements could be accepted by the 
term A. The hype-entropy He is a measure of the uncertainty of the entropy En. The 
larger the value of He, the more random the set of membership degrees is distributed. 
Figure 1 shows the appropriate NC for the term “young”. Instead of a membership 
curve, the mapping from VueU to the interval [0,1] is a one-point to multi-point 
transition, which shows the uncertainty: fuzziness and randomness of an element 
belonging to the term. So, the degree of membership of u to T is a probability 
distribution rather than a fixed value. 




Fig. 1. The Normal Cloud of the Term “young”(20, 0.7, 0.025) 
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Based on cloud models, a linguistic variable can be defined as a set of linguistic 
terms, represented as A{Ai(Exi,Eni,Hei)...A 2 (Ex 2 , En 2 , He 2 ) ...Am(Exm, En^, HCm)}. 
To realize the transform between quantitative values and qualitative concepts, four 
kinds of cloud generators are introduced as basic cloud generator, u-condition cloud 
generator, p-condition cloud generator and backward cloud generator. [1, 3] 

In the daily life of our society, most of the social behaviors carry out according to 
the nature time: year, month, or day. Thus their varying regularity is periodical in 
some degree. However, there are too many uncertain factors influencing those 
behaviors, and we couldn’t catch every factors well and truly. So we take them as 
quasi-periodical regularity and represent them as predictive linguistic rules. To realize 
the linguistic rules, it is difficult but important to maintain the uncertainty of 
inference. Here we use linguistic rule based on cloud models to carry on the soft 
inference “A— >B”, meanwhile A and B are linguistic concepts represented by cloud 
models. Two cloud generators, u-condition cloud generator and p-condition cloud 
generator, are connected to construct the rule, which is shown in figure 2. See [1, 2, 3, 
and 9] for details. 




Fig. 2. the Implement of Predictive Linguistic Rule A— >B 

Although the knowledge of quasi-periodical regularity grasps the time-series in a 
whole, it is not precise and fresh enough to forecast. For the time-series data take on 
stability in a short time, we can discover predictive knowledge of current tendency 
from the current data. It should be uncertain, obeying the distribution of current data, 
and be represented as a qualitative tendency instead of a quantitative value. We 
introduce a new concept based on cloud model — Current Cloud It(Ex, En, He) to 
represent the predictive knowledge of current tendency. Meanwhile, Ex points out the 
center of current tendency as the expected predictive value, En is a measure of the 
frizziness of current tendency, showing how many values could be accepted by the 
tendency, and He gives a measure of the uncertainty of the entropy En. 



3 Mechanism of Time-Series Prediction with Clond Models 

Now, two kinds of predictive knowledge have been described. The predictive 
linguistic rules emphasize the overall regularity of the time-series data, while the 
current cloud focuses on the temporary tendency of current data. Their knowledge 
granularities are also different. The predictive linguistic rules are the summary of the 
time-series data in many historical periods, so its granularity is higher than that of 
current cloud, which only sums up the current data in the recent period. The following 
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algorithm is to realize time-series prediction with cloud models by synthesizing these 
two kinds of predictive knowledge. 

Algorithm: Time-series Prediction with Cloud Models 

1 . Dividing the time series data set D into the set of historical data HD and the set of 
current data CD; 

2. Mining quasi-periodical regularity as set of predictive linguistic rules: 
PLR={Ai— >B;, i=l,...,m} from HD; 

3. Mining current tendency as current cloud It from CD; 

4. Activating a linguistic rule At— >Bi in PLR, through judging which antecedent 
linguistic term the time value at belongs to; 

5. Synthesizing the historical cloud Bi and the current cloud h into a new cloud St; 

6. Realizing the soft inference Aj^St to forecast. 

We can use the following equations to synthesize the cloud Bi (Exi, Eni, Hei) and 
It (Ex 2 , En 2 , He 2 ) to St (Ex, En, He): 



Ex 



Ex 1 En Y + Ex 2 En 2 
En ^ + En 2 



En 



En I + En 2 He 



He I 1 -b He 2 2 

En j + En 2 



( 2 ) 



The synthesized cloud is used to synthesize linguistic terms into a generalized one. 
In this paper, however, we give another usage of the synthesized cloud. That is to sum 
up two kinds of predictive knowledge. The expected value Ex of synthesized cloud St 
ranges between the expected values of Bi and ft, while the entropy En of St is the sum 
of the entropies of Bi and ft. So the synthesized cloud St covers more values than Bi 
and It, which means it provides more possible predictive results than any of them. 
Through synthesized cloud, the knowledge of quasi-periodical regularity and current 
tendency are blent together. They can not be distinguished in the cloud St any more. 

The soft inference Ai^St can be realized directly by the connection of two cloud 
generators as described in section 2. Different from other predictive methods, this 
mechanism can provide the predictive results in many forms instead of just a dull 
digital value. Firstly, we can get the two drops (at, p) and (p, y) respectively on the 
two mathematically expected curves of the antecedent cloud Ai and consequent 
cloud St, and provide y as the expected predictive result. Secondly, we can get two 
drops (at, Pi) and (p,, yt) randomly, and output yt as an uncertain predictive result. 
Lastly, we can activate the rule for many times, and get as many drops as we like. The 
set of yi can be provided to the user for further analysis. It is clear that the predictive 
results is uncertain, which exactly reflects our thought of randomization and soft 
inference. 



4 Experiment and Conclusion 

Sales prediction has important significant in commercial behaviors, while production 
sales always have quasi-period regularity according to the period year. We used the 
auto sales in dollar nominal plotted alongside the CPI from 1989 to 1995 as our 
experimental data. Five predictive linguistic rules were discovered from them. Figure 
3 shows the predictive results of short rang and long rang horizon from the 20th week 
to the 52nd week of 1995 compared with the original time-series data of 1995. 
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Retail sales, autoaotive dealers. $b n. nsa 




Fig. 3. the Forecasting Results of the Sales in 1995 



This paper proposes a new mechanism of time-series prediction with cloud models. 
It can discover two kinds of qualitative predictive knowledge, quasi-periodical 
regularity and current tendency, which are respectively represented as predictive 
linguistic rules and current cloud based on cloud models. This mechanism not only 
synthesizes different predictive knowledge with different granularities, but also 
combines two kinds of predictive strategy: local prediction and overall prediction. We 
intend to focus our future efforts in the directions concerning the dimension of 
predictive factor and the concept drift. 
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